I am learning 're' part of Python, and the named pattern (?P=name) confused me,
When I using re.sub() to make some exchange for digit and character, the patter '(?P=name)' doesn't work, but the pattern '\N' and '\g<name>' still make sense. Code below:
[IN]print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'(?P=char)-(?P=digit)', '123-abcd'))
[OUT] (?P=char)-(?P=digit)
[IN] print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'\2-\1', '123-abcd'))
[OUT] abcd-123
[IN] print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'\g<char>-\g<digit>', '123-abcd'))
[OUT] abcd-123
Why it failed to make substitute when I use (?P=name)?
And how to use it correctly?
I am using Python 3.5
The (?P=name) is an inline (in-pattern) backreference. You may use it inside a regular expression pattern to match the same content as is captured by the corresponding named capturing group, see the Python Regular Expression Syntax reference:
(?P=name)
A backreference to a named group; it matches whatever text was matched by the earlier group named name.
See this demo: (?P<digit>\d{3})-(?P<char>\w{4})&(?P=char)-(?P=digit) matches 123-abcd&abcd-123 because the "digit" group matches and captures 123, "char" group captures abcd and then the named inline backreferences match abcd and 123.
To replace matches, use \1, \g<1> or \g<char> syntax with re.sub replacement pattern. Do not use (?P=name) for that purpose:
repl can be a string or a function... Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern... In string-type repl arguments, in addition to the character escapes and backreferences described above, \g<name> will use the substring matched by the group named name, as defined by the (?P<name>...) syntax. \g<number> uses the corresponding group number; \g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0. \20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'. The backreference \g<0> substitutes in the entire substring matched by the RE.
You can check the details of using and back-referencing ?P visiting:
https://docs.python.org/3/library/re.html
and using CTRL+F in your browser to look for (?P...). It comes a nice chart with all the instructions about when you can make use of ?P=name.
For this example, you're doing right at your third re.sub() call.
In the all re.sub() calls you can only use the ?P=name syntax in the first string parameter of this method and you don't need it in the second string parameter because you have the \g syntax.
In case you're confuse about the ?P=name being useful, it is, but for making a match by backreferencing an already named string.
Example: you want to match potatoXXXpotato and replace it for YYXXXYY. You could make:
re.sub(r'(?P<myName>potato)(XXX)(?P=myName)', r'YY\2YY', 'potatoXXXpotato')
or
re.sub(r'(?P<myName>potato)(?P<triple>XXX)(?P=myName)', r'YY\g<triple>YY', 'potatoXXXpotato')
Related
How can I replace foobar with foo123bar?
This doesn't work:
>>> re.sub(r'(foo)', r'\1123', 'foobar')
'J3bar'
This works:
>>> re.sub(r'(foo)', r'\1hi', 'foobar')
'foohibar'
I think it's a common issue when having something like \number. Can anyone give me a hint on how to handle this?
The answer is:
re.sub(r'(foo)', r'\g<1>123', 'foobar')
Relevant excerpt from the docs:
In addition to character escapes and
backreferences as described above,
\g will use the substring
matched by the group named name, as
defined by the (?P...) syntax.
\g uses the corresponding
group number; \g<2> is therefore
equivalent to \2, but isn’t ambiguous
in a replacement such as \g<2>0. \20
would be interpreted as a reference to
group 20, not a reference to group 2
followed by the literal character '0'.
The backreference \g<0> substitutes in
the entire substring matched by the
RE.
In Python in the re module there is the following function:
re.sub(pattern, repl, string, count=0, flags=0) – Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.
I've found it can work like this:
print re.sub('[a-z]*\d+','lion','zebra432') # prints 'lion'
I was wondering, is there an easy way to use regular expressions in the replacement string, so that the replacement string contains part of the original regular expression/original string? Specifically, can I do something like this (which doesn't work)?
print re.sub('[a-z]*\d+', 'lion\d+', 'zebra432')
I want that to print 'lion432'. Obviously, it does not. Rather, it prints 'lion\d+'. Is there an easy way to use parts of the matching regular expression in the replacement string?
By the way, this is NOT a special case. Please do NOT assume that the number will always come at the end, the words will always come in the beginning, etc. I want to know a solution to all regexes in general.
Thanks
Place \d+ in a capture group (...) and then use \1 to refer to it:
>>> import re
>>> re.sub('[a-z]*(\d+)', r'lion\1', 'zebra432')
'lion432'
>>>
>>> # You can also refer to more than one capture group
>>> re.sub('([a-z]*)(\d+)', r'\1lion\2', 'zebra432')
'zebralion432'
>>>
From the docs:
Backreferences, such as \6, are replaced with the substring matched
by group 6 in the pattern.
Note that you will also need to use a raw-string so that \1 is not treated as an escape sequence.
I am new to Python and Python RE.
I am trying to make a parser for ARM assembly code. I want to make regular expression for matching conditional branch instructions which could be:
beq, bne, blt, bgt
I tried a regular expression of the form
'b[eq|ne|lt|gt]'
But this does not match. Can someone please help me with this?
You should be using parentheses for options, not square brackets:
b(eq|ne|lt|gt)
And you'd usually want a non-capture group:
b(?:eq|ne|lt|gt)
And you can also make it a little more optimised too:
b(?:eq|ne|[lg]t)
Square brackets will be understood as being any of the characters or range of characters. So [eq|ne|lt|gt] effectively means either one of e, q, |, n, e (again, so it becomes redundant), etc.
Try the following pattern: b(?:eq|ne|lt|gt)
[] Character set: Will only match any one character inside the brackets. You can specify a range of characters by using the metacharacter -, eg: [a-e] or even negate the expression by using the metacharacter ^, eg: [^aeiou]
() Capturing parentesis: Used for grouping part & for creating number capturing group, you can disable this feature by using the following char-set ?: within the capturing parentesis, eg(?:)
As mentioned above, you should be using the capturing parentesis for more than one character to be matched, so, that is why your pattern using brackets did not match your string.
Please note that using the non capturing parentesis was meant to no save any data being matched, however you can remove the metacharacters ?: in order to capture the group.
As python performs perl compatible regular expression engine, you are able to use named captured groups & numbered backreferences, the main advantage of using it, is to keep your expression easy to maintain, read, edit, etc.
Eg:
(?P<opcode>b(?:eq|ne|lt|gt)) - Will capture the match of your pattern b(?:eq|ne|lt|gt) into the backreference name opcode
I have a a previously matched pattern such as:
<a href="somelink here something">
Now I wish to extract only the value of a specific attribute(s) in the tag such but this may be anything an occur anywhere in the tag.
regex_pattern=re.compile('href=\"(.*?)\"')
Now I can use the above to match the attribute and the value part but I need to extract only the (.*?) part. (Value)
I can ofcourse strip href=" and " later but I'm sure I can use regex properly to extract only the required part.
In simple words I want to match
abcdef=\"______________________\"
in the pattern but want only the
____________________
Part
How do I do this?
Just use re.search('href=\"(.*?)\"', yourtext).group(1) on the matched string yourtext and it will yield the matched group.
Take a look at the .group() method on regular expression MatchObject results.
Your regular expression has an explicit group match group (the part in () parethesis), and the .group() method gives you direct access to the string that was matched within that group. MatchObject are returned by several re functions and methods, including the .search() and .finditer() functions.
Demonstration:
>>> import re
>>> example = '<a href="somelink here something">'
>>> regex_pattern=re.compile('href=\"(.*?)\"')
>>> regex_pattern.search(example)
<_sre.SRE_Match object at 0x1098a2b70>
>>> regex_pattern.search(example).group(1)
'somelink here something'
From the Regular Expression syntax documentation on the (...) parenthesis syntax:
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(] [)].
How can I replace foobar with foo123bar?
This doesn't work:
>>> re.sub(r'(foo)', r'\1123', 'foobar')
'J3bar'
This works:
>>> re.sub(r'(foo)', r'\1hi', 'foobar')
'foohibar'
I think it's a common issue when having something like \number. Can anyone give me a hint on how to handle this?
The answer is:
re.sub(r'(foo)', r'\g<1>123', 'foobar')
Relevant excerpt from the docs:
In addition to character escapes and
backreferences as described above,
\g will use the substring
matched by the group named name, as
defined by the (?P...) syntax.
\g uses the corresponding
group number; \g<2> is therefore
equivalent to \2, but isn’t ambiguous
in a replacement such as \g<2>0. \20
would be interpreted as a reference to
group 20, not a reference to group 2
followed by the literal character '0'.
The backreference \g<0> substitutes in
the entire substring matched by the
RE.