Consider the text below:
foobar¬
nextline
The regex (.*?(?: *?\n)) matches foobar¬
where ¬ denotes a newline \n.
Why does the regex match it? shouldn't the non-capture group exclude it?
Tested on Regex101 for the python dialect.
“Non-capturing group” refers to the fact that matches within that group will not be available as separate groups in the resulting match object. For example:
>>> re.search('(foo)(bar)', 'foobarbaz').groups()
('foo', 'bar')
>>> re.search('(foo)(?:bar)', 'foobarbaz').groups()
('foo',)
However, everything that is part of an expression is matched and as such appears in the resulting match (Group 0 shows the whole match):
>>> re.search('(foo)(bar)', 'foobarbaz').group(0)
'foobar'
>>> re.search('(foo)(?:bar)', 'foobarbaz').group(0)
'foobar'
If you don’t want to match that part but still want to make sure it’s there, you can use a lookahead expression:
>>> re.search('(foo)(?=bar)', 'foobarbaz')
<_sre.SRE_Match object; span=(0, 3), match='foo'>
>>> re.search('(foo)(?=bar)', 'foobaz')
None
So in your case, you could use (.*?(?= *?\n)).
The \n is captured because the non-capturing group is inside the capturing group:
>>> s = 'foobar\nnextline'
>>> re.search(r'(.*?(?: *?\n))', s).groups()
('foobar\n',)
If you don't want that, place the non-capturing group outside of the capturing one:
>>> re.search(r'(.*?)(?: *?\n)', s).groups()
('foobar',)
Related
I have the following Python regex:
>>> p = re.compile(r"(\b\w+)\s+\1")
\b : word boundary
\w+ : one or more alphanumerical characters
\s+ : one or more whitespaces (can be , \t, \n, ..)
\1 : backreference to group 1 ( = the part between (..))
This regex should find all double occurences of a word - if the two occurences are next to each other with some whitespace in between.
The regex seems to work fine when using the search function:
>>> p.search("I am in the the car.")
<_sre.SRE_Match object; span=(8, 15), match='the the'>
The found match is the the, just as I had expected. The weird behaviour is in the findall function:
>>> p.findall("I am in the the car.")
['the']
The found match is now only the. Why the difference?
When using groups in a regular expression, findall() returns only the groups; from the documentation:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
You can't avoid using groups when using backreferences, but you can put a new group around the whole pattern:
>>> p = re.compile(r"((\b\w+)\s+\2)")
>>> p.findall("I am in the the car.")
[('the the', 'the')]
The outer group is group 1, so the backreference should be pointing to group 2. You now have two groups, so there are two results per entry. Using a named group might make this more readable:
>>> p = re.compile(r"((?P<word>\b\w+)\s+(?P=word))")
You can filter that back to just the outer group result:
>>> [m[0] for m in p.findall("I am in the the car.")]
['the the']
I am trying to get element name and digits as regex's group. Even for the simplest case, like as shown, I cant:
>>> import re
>>> t = "Fe35C65"
>>> m = re.match("(\D*\d+\D*\d+)", t)
>>> print(m.group(1))
Fe35C65
>>> print(m.group(0))
Fe35C65
>>> print(m)
<_sre.SRE_Match object; span=(0, 7), match='Fe35C65'>
>>>
What I am looking is the output as
Name[0] = "Fe" Name[1]="C"
Num[0] = 35, Num[1] = 65
Here, there is 2 set, but it should not be limited. But, original problem is extractinh the data.
The problem is that re.match only returns 1 match and the number of capturing groups is fixed.
To match multiple occurrences of your pattern you may use re.findall and the r'(\D*)(\d+)' pattern that matches and captures 0+ non-digit symbols onto Group 1 and then 1+ digits into Group 2:
re.findall(r'(\D*)(\d+)', t)
Since re.findall returns captured texts only, you will get a list of 2-element tuples.
Alternative solution with PyPi regex module
You may both validate a string and easily capture all occurrences of the multiple pairs of groups with the PyPi regex module like this:
>>> import regex
>>> t = "Fe35C65"
>>> pat = r"(?:(\D*)(\d+))+"
>>> m = regex.fullmatch(pat, t)
>>> if m:
print(zip(m.captures(1), m.captures(2)))
[('Fe', '35'), ('C', '65')]
The point here is:
(?:(\D*)(\d+))+ matches 1+ occurrences of (Group 1) 0+ non-digits and (Group 2) 1+ digits (extraction)
regex.fullmatch requires the entire string to match the pattern (validation)
The captures are stored in a group capture stack and can be accessed with .captures(n).
If there can be many you can use this:
x="Fe35C65"
m=re.compile(r"(\D+)(\d+)")
for i in m.finditer(x):
print i.groups()
Output:
('Fe', '35')
('C', '65')
Using this pattern:
(?<=\(\\\\).*(?=\))
and this subject string: '(\\Drafts) "/" "&g0l6P3ux-"'
I was expecting to match Drafts
However, it is not working. Can someone explain why?
I am using re module in Python,the following is what I did:
>>> pattern = re.compile("(?<=\(\\\\).*?(?=\\))")
>>> pattern.pattern
'(?<=\\(\\\\).*?(?=\\))'
>>> two
'(\\Drafts) "/" "&g0l6P3ux-"'
>>> match = pattern.search(two)
>>> match
<_sre.SRE_Match object at 0x1096e45e0>
>>> match.groups()
()
>>> match.group(0)
'Drafts'
>>>
my question is why groups get nothing but group get the right answer?
match.groups() is empty because your pattern does not define any capturing groups. match.group(0) is the complete match, while match.group(1) would be the first capturing group if there was one.
To improve readability you should express regex patterns as raw strings. Yours can be written as
r"(?<=\(\\).*?(?=\))"
To break it down, there is a lookbehind for literal (\, then .*? and finally a lookahead for literal ).
I came across a regular expression today but it was very poorly and scarcely explained. What is the purpose of (?:) regex in python and where & when is it used?
I have tried this but it doesn't seem to be working. Why is that?
word = "Hello. ) kahn. ho.w are 19tee,n doing 2day; (x+y)"
expressoin = re.findall(r'(?:a-z\+a-z)', word);
From the re module documentation:
(?:...)
A non-capturing version of regular parentheses. Matches whatever
regular expression is inside the parentheses, but the substring
matched by the group cannot be retrieved after performing a match or
referenced later in the pattern.
Basically, it's the same thing as (...) but without storing a captured string in a group.
Demo:
>>> import re
>>> re.search('(?:foo)(bar)', 'foobar').groups()
('bar',)
Only one group is returned, containing bar. The (?:foo) group was not.
Use this whenever you need to group metacharacters that would otherwise apply to a larger section of the expression, such as | alternate groups:
monty's (?:spam|ham|eggs)
You don't need to capture the group but do need to limit the scope of the | meta characters.
As for your sample attempt; using re.findall() you often do want to capture output. You most likely are looking for:
re.findall('([a-z]\+[a-z])', word)
where re.findall() will return a list tuples of all captured groups; if there is only one captured group, it's a list of strings containing just the one group per match.
Demo:
>>> word = "Hello. ) kahn. ho.w are 19tee,n doing 2day; (x+y)"
>>> re.findall('([a-z]\+[a-z])', word)
['x+y']
?: is used to ignore capturing a group.
For example in regex (\d+) match will be in group \1
But if you use (?:\d+) then there will be nothing in group \1
It is used for non-capturing group:
>>> matched = re.search('(?:a)(b)', 'ab') # using non-capturing group
>>> matched.group(1)
'b'
>>> matched = re.search('(a)(b)', 'ab') # using capturing group
>>> matched.group(1)
'a'
>>> matched.group(2)
'b'
I knew that [] denotes a set of allowable characters -
>>> p = r'^[ab]$'
>>>
>>> re.search(p, '')
>>> re.search(p, 'a')
<_sre.SRE_Match object at 0x1004823d8>
>>> re.search(p, 'b')
<_sre.SRE_Match object at 0x100482370>
>>> re.search(p, 'ab')
>>> re.search(p, 'ba')
But ... today I came across an expression with vertical bars within parenthesis to define mutually exclusive patterns -
>>> q = r'^(a|b)$'
>>>
>>> re.search(q, '')
>>> re.search(q, 'a')
<_sre.SRE_Match object at 0x100498dc8>
>>> re.search(q, 'b')
<_sre.SRE_Match object at 0x100498e40>
>>> re.search(q, 'ab')
>>> re.search(q, 'ba')
This seems to mimic the same functionality as above, or am I missing something?
PS: In Python parenthesis themselves are used to define logical groups of matched text. If I use the second technique, then how do I use parenthesis for both jobs?
In this case it is the same.
However, the alternation is not just limited to a single character. For instance,
^(hello|world)$
will match "hello" or "world" (and only these two inputs) while
^[helloworld]$
would just match a single character ("h" or "w" or "d" or whatnot).
Happy coding.
[ab] matches one character (a or b) and doesn't capture the group. (a|b) captures a or b, and matches it. In this case, no big difference, but in more complex cases [] can only contain characters and character classes, while (|) can contain arbitrarily complex regex's on either side of the pipe
In the example you gave they are interchangeable. There are some differences worth noting:
In the character class square brackets you don't have to escape anything but a dash or square brackets, or the caret ^ (but then only if it's the first character.)
Parentheses capture matches so you can refer to them later. Character class matches don't do that.
You can match multi-character strings in parentheses but not in character classes