python re.sub group: number after \number - python

How can I replace foobar with foo123bar?
This doesn't work:
>>> re.sub(r'(foo)', r'\1123', 'foobar')
'J3bar'
This works:
>>> re.sub(r'(foo)', r'\1hi', 'foobar')
'foohibar'
I think it's a common issue when having something like \number. Can anyone give me a hint on how to handle this?

The answer is:
re.sub(r'(foo)', r'\g<1>123', 'foobar')
Relevant excerpt from the docs:
In addition to character escapes and
backreferences as described above,
\g will use the substring
matched by the group named name, as
defined by the (?P...) syntax.
\g uses the corresponding
group number; \g<2> is therefore
equivalent to \2, but isn’t ambiguous
in a replacement such as \g<2>0. \20
would be interpreted as a reference to
group 20, not a reference to group 2
followed by the literal character '0'.
The backreference \g<0> substitutes in
the entire substring matched by the
RE.

Related

Using re.sub with capture group references and numbers [duplicate]

How can I replace foobar with foo123bar?
This doesn't work:
>>> re.sub(r'(foo)', r'\1123', 'foobar')
'J3bar'
This works:
>>> re.sub(r'(foo)', r'\1hi', 'foobar')
'foohibar'
I think it's a common issue when having something like \number. Can anyone give me a hint on how to handle this?
The answer is:
re.sub(r'(foo)', r'\g<1>123', 'foobar')
Relevant excerpt from the docs:
In addition to character escapes and
backreferences as described above,
\g will use the substring
matched by the group named name, as
defined by the (?P...) syntax.
\g uses the corresponding
group number; \g<2> is therefore
equivalent to \2, but isn’t ambiguous
in a replacement such as \g<2>0. \20
would be interpreted as a reference to
group 20, not a reference to group 2
followed by the literal character '0'.
The backreference \g<0> substitutes in
the entire substring matched by the
RE.

Named backreference (?P=name) issue in Python re

I am learning 're' part of Python, and the named pattern (?P=name) confused me,
When I using re.sub() to make some exchange for digit and character, the patter '(?P=name)' doesn't work, but the pattern '\N' and '\g<name>' still make sense. Code below:
[IN]print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'(?P=char)-(?P=digit)', '123-abcd'))
[OUT] (?P=char)-(?P=digit)
[IN] print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'\2-\1', '123-abcd'))
[OUT] abcd-123
[IN] print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'\g<char>-\g<digit>', '123-abcd'))
[OUT] abcd-123
Why it failed to make substitute when I use (?P=name)?
And how to use it correctly?
I am using Python 3.5
The (?P=name) is an inline (in-pattern) backreference. You may use it inside a regular expression pattern to match the same content as is captured by the corresponding named capturing group, see the Python Regular Expression Syntax reference:
(?P=name)
A backreference to a named group; it matches whatever text was matched by the earlier group named name.
See this demo: (?P<digit>\d{3})-(?P<char>\w{4})&(?P=char)-(?P=digit) matches 123-abcd&abcd-123 because the "digit" group matches and captures 123, "char" group captures abcd and then the named inline backreferences match abcd and 123.
To replace matches, use \1, \g<1> or \g<char> syntax with re.sub replacement pattern. Do not use (?P=name) for that purpose:
repl can be a string or a function... Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern... In string-type repl arguments, in addition to the character escapes and backreferences described above, \g<name> will use the substring matched by the group named name, as defined by the (?P<name>...) syntax. \g<number> uses the corresponding group number; \g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0. \20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'. The backreference \g<0> substitutes in the entire substring matched by the RE.
You can check the details of using and back-referencing ?P visiting:
https://docs.python.org/3/library/re.html
and using CTRL+F in your browser to look for (?P...). It comes a nice chart with all the instructions about when you can make use of ?P=name.
For this example, you're doing right at your third re.sub() call.
In the all re.sub() calls you can only use the ?P=name syntax in the first string parameter of this method and you don't need it in the second string parameter because you have the \g syntax.
In case you're confuse about the ?P=name being useful, it is, but for making a match by backreferencing an already named string.
Example: you want to match potatoXXXpotato and replace it for YYXXXYY. You could make:
re.sub(r'(?P<myName>potato)(XXX)(?P=myName)', r'YY\2YY', 'potatoXXXpotato')
or
re.sub(r'(?P<myName>potato)(?P<triple>XXX)(?P=myName)', r'YY\g<triple>YY', 'potatoXXXpotato')

Is there a way to use regular expressions in the replacement string in re.sub() in Python?

In Python in the re module there is the following function:
re.sub(pattern, repl, string, count=0, flags=0) – Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.
I've found it can work like this:
print re.sub('[a-z]*\d+','lion','zebra432') # prints 'lion'
I was wondering, is there an easy way to use regular expressions in the replacement string, so that the replacement string contains part of the original regular expression/original string? Specifically, can I do something like this (which doesn't work)?
print re.sub('[a-z]*\d+', 'lion\d+', 'zebra432')
I want that to print 'lion432'. Obviously, it does not. Rather, it prints 'lion\d+'. Is there an easy way to use parts of the matching regular expression in the replacement string?
By the way, this is NOT a special case. Please do NOT assume that the number will always come at the end, the words will always come in the beginning, etc. I want to know a solution to all regexes in general.
Thanks
Place \d+ in a capture group (...) and then use \1 to refer to it:
>>> import re
>>> re.sub('[a-z]*(\d+)', r'lion\1', 'zebra432')
'lion432'
>>>
>>> # You can also refer to more than one capture group
>>> re.sub('([a-z]*)(\d+)', r'\1lion\2', 'zebra432')
'zebralion432'
>>>
From the docs:
Backreferences, such as \6, are replaced with the substring matched
by group 6 in the pattern.
Note that you will also need to use a raw-string so that \1 is not treated as an escape sequence.

Python regex - (\w+) results different output when used with complex expression

I have doubt on python regex operation. Here you go my sample test.
>>>re.match(r'(\w+)','a-b') gives an output
>>> <_sre.SRE_Match object at 0x7f51c0033210>
>>>re.match(r'(\w+):(\d+)','a-b:1')
>>>
Why does the 2nd regex condition doesn't give match object though the 1st regex gives match object for a normal string match condition, irrespective of special characters is available in the string?
However, \w+ will matches for [a-z,A-Z,_]. I'm not clear why (\w+) gives matched object for the string 'a-b'. How can I check whether the given string doesn't contain any special characters?
Taking a look at the actual match will give you an idea of what happens.
>>> re.match(r'(\w+)', 'a-b')
<_sre.SRE_Match object at 0x0000000002DE45D0>
>>> _.groups()
('a',)
As you can see, the expression matched a. The character sequence \w only contains actual word characters, but not separators like dashes. So you can’t actually match a-b using just a \w+.
Now in the second expression one might think that it would match b:1 at least, given that \w+ matches b and :(\d+) does match the 1. However it does not happen due to how re.match works. As the documentation hints, it only tries to match “at the beginning of string”. So when using re.match there is an implicit ^ at the beginning of the expression that makes it only match from the start. So it actually tries to find a match starting with a.
Instead, you can use re.search which actually looks in the whole string if it can match the expression anywhere. So there, you will get a result:
>>> re.search(r'(\w+):(\d+)', 'a-b:1')
<_sre.SRE_Match object at 0x0000000002E01B58>
>>> _.groups()
('b', '1')
For further information on the search vs. match topic, check this section in the manual.
And finally, if you want to match dashes too, you can use a character sequence [\w-] for example:
>>> re.match(r'([\w-]+):(\d+)', 'a-b:1')
<_sre.SRE_Match object at 0x0000000002E01B58>
>>> _.groups()
('a-b', '1')
The first matches the a - one or more word chars.
The second is one or more word chars immediately followed by a : which there aren't...
[a-z,A-Z,_] (the equivalent of \w) means a to z and A to Z - it isn't the literal hyphen in this context, if you did want a hyphen, put it as the first or last character of a character class.
Match's docs say
If zero or more characters at the beginning of string match the
regular expression pattern, return a corresponding MatchObject
instance.
match method will return the matched object if it finds a match at the beginning of the string. (\w+) matches a in a-b.
print re.match(r'(\w+)','a-b').group()
will print
a
In the second case ((\w+):(\d+)), the actual string which gets matched is b:1, which is not at the beginning of the string. That's why its returning None.
How can I check whether the given string doesn't contain any special characters?
I would say, the second regular expression which you have used should be enough and match function should be enough. I insist on match, since there are differences between match and search http://docs.python.org/2.7/library/re.html#search-vs-match
Remember, you

Can't find the correct regex syntax to match newline or end of string

This feels like a really simple question, but I can't find the answer anywhere.
(Notes: I'm using Python, but this shouldn't matter.)
Say I have the following string:
s = "foo\nbar\nfood\nfoo"
I am simply trying to find a regex that will match both instances of "foo", but not "food", based on the fact that the "foo" in "food" is not immediately followed by either a newline or the end of the string.
This is perhaps an overly complicated way to express my question, but it gives something concrete to work with.
Here are some of the things I have tried, with results (Note: the result I want is [foo\n, foo]):
foo[\n\Z] => ['foo\n']
foo(\n\Z) => ['\n', ''] <= This seems to match the newline and EOS, but not the foo
foo($|\n) => ['\n', '']
(foo)($|\n) => [(foo,'\n'), (foo,'')] <= Almost there, and this is a useable plan B, but I would like to find the perfect solution.
The only thing I found that does work is:
foo$|foo\n => ['foo\n', `'foo']
This is fine for such a simple example, but it is easy to see how it could become unwieldy with a much larger expression (and yes, this foo thing is a stand in for the larger expression I am actually using).
Interesting aside: The closest SO question I could find to my problem was this one: In regex, match either the end of the string or a specific character
Here, I could simply substitute \n for my 'specific character'. Now, the accepted answer uses the regex /(&|\?)list=.*?(&|$)/. I notice that the OP was using JavaScript (question was tagged with the javascript tag), so maybe the JavaScript regex interpreter is different, but when I use the exact strings given in the question with the above regex in Python, I get bad results:
>>> findall("(&|\?)list=.*?(&|$)", "index.php?test=1&list=UL")
[('&', '')]
>>> findall("(&|\?)list=.*?(&|$)", "index.php?list=UL&more=1")
[('?', '&')]
So, I'm stumped.
>>> import re
>>> re.findall(r'foo(?:$|\n)', "foo\nbar\nfood\nfoo")
['foo\n', 'foo']
(?:...) makes a non-capturing group.
This works because (from re module reference):
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
You could use re.MULTILINE and include an optional linebreak after the $ in your pattern:
s = "foo\nbar\nfood\nfoo"
pattern = re.compile('foo$\n?', re.MULTILINE)
print re.findall(pattern, s)
# -> ['foo\n', 'foo']
If you're only concerned with foo:
In [42]: import re
In [43]: strs="foo\nbar\nfood\nfoo"
In [44]: re.findall(r'\bfoo\b',strs)
Out[44]: ['foo', 'foo']
\b is denotes a word boundary:
\b
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore
characters, so the end of a word is indicated by whitespace or a
non-alphanumeric, non-underscore character. Note that formally, \b is
defined as the boundary between a \w and a \W character (or vice
versa), or between \w and the beginning/end of the string, so the
precise set of characters deemed to be alphanumeric depends on the
values of the UNICODE and LOCALE flags. For example, r'\bfoo\b'
matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or
'foo3'. Inside a character range, \b represents the backspace
character, for compatibility with Python’s string literals.
(Source)

Categories

Resources