I knew that [] denotes a set of allowable characters -
>>> p = r'^[ab]$'
>>>
>>> re.search(p, '')
>>> re.search(p, 'a')
<_sre.SRE_Match object at 0x1004823d8>
>>> re.search(p, 'b')
<_sre.SRE_Match object at 0x100482370>
>>> re.search(p, 'ab')
>>> re.search(p, 'ba')
But ... today I came across an expression with vertical bars within parenthesis to define mutually exclusive patterns -
>>> q = r'^(a|b)$'
>>>
>>> re.search(q, '')
>>> re.search(q, 'a')
<_sre.SRE_Match object at 0x100498dc8>
>>> re.search(q, 'b')
<_sre.SRE_Match object at 0x100498e40>
>>> re.search(q, 'ab')
>>> re.search(q, 'ba')
This seems to mimic the same functionality as above, or am I missing something?
PS: In Python parenthesis themselves are used to define logical groups of matched text. If I use the second technique, then how do I use parenthesis for both jobs?
In this case it is the same.
However, the alternation is not just limited to a single character. For instance,
^(hello|world)$
will match "hello" or "world" (and only these two inputs) while
^[helloworld]$
would just match a single character ("h" or "w" or "d" or whatnot).
Happy coding.
[ab] matches one character (a or b) and doesn't capture the group. (a|b) captures a or b, and matches it. In this case, no big difference, but in more complex cases [] can only contain characters and character classes, while (|) can contain arbitrarily complex regex's on either side of the pipe
In the example you gave they are interchangeable. There are some differences worth noting:
In the character class square brackets you don't have to escape anything but a dash or square brackets, or the caret ^ (but then only if it's the first character.)
Parentheses capture matches so you can refer to them later. Character class matches don't do that.
You can match multi-character strings in parentheses but not in character classes
Related
This question already has answers here:
What is the difference between re.search and re.match?
(9 answers)
Closed 3 years ago.
While reading the docs, I found out that the whole difference between re.match() and re.search() is that re.match() starts checking only from the beginning of the string.
>>> import re
>>> a = 'abcde'
>>> re.match(r'b', a)
>>> re.search(r'b', a)
<_sre.SRE_Match object at 0xffe25c98>
>>> re.search(r'^b', a)
>>>
Is there anything I am misunderstanding, or is there no difference at all between re.search('^' + pattern) and re.match(pattern)?
Is it a good practice to only use re.search()?
You should take a look at Python's re.search() vs. re.match() document which clearly mentions about the other difference which is:
Note however that in MULTILINE mode match() only matches at the beginning of the string, whereas using search() with a regular expression beginning with '^' will match at the beginning of each line.
>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
<_sre.SRE_Match object; span=(4, 5), match='X'>
The first difference (for future readers) being:
Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default).
For example:
>>> re.match("c", "abcdef") # No match
>>> re.search("c", "abcdef") # Match
<_sre.SRE_Match object; span=(2, 3), match='c'>
Regular expressions beginning with '^' can be used with search() to restrict the match at the beginning of the string:
>>> re.match("c", "abcdef") # No match
>>> re.search("^c", "abcdef") # No match
>>> re.search("^a", "abcdef") # Match
<_sre.SRE_Match object; span=(0, 1), match='a'>
If you look at this from a code golfing perspective, I'd say there is some use in keeping the two functions separate.
If you're looking from the beginning of the string, re.match, would be preferable to re.search, because the former has one character less in its name, thus saving a byte. Furthermore, with re.search, you also have to add the start-of-line anchor ^ to signify matching from the start. You don't need to specify this with re.match because it is implied, further saving another byte.
Consider the text below:
foobar¬
nextline
The regex (.*?(?: *?\n)) matches foobar¬
where ¬ denotes a newline \n.
Why does the regex match it? shouldn't the non-capture group exclude it?
Tested on Regex101 for the python dialect.
“Non-capturing group” refers to the fact that matches within that group will not be available as separate groups in the resulting match object. For example:
>>> re.search('(foo)(bar)', 'foobarbaz').groups()
('foo', 'bar')
>>> re.search('(foo)(?:bar)', 'foobarbaz').groups()
('foo',)
However, everything that is part of an expression is matched and as such appears in the resulting match (Group 0 shows the whole match):
>>> re.search('(foo)(bar)', 'foobarbaz').group(0)
'foobar'
>>> re.search('(foo)(?:bar)', 'foobarbaz').group(0)
'foobar'
If you don’t want to match that part but still want to make sure it’s there, you can use a lookahead expression:
>>> re.search('(foo)(?=bar)', 'foobarbaz')
<_sre.SRE_Match object; span=(0, 3), match='foo'>
>>> re.search('(foo)(?=bar)', 'foobaz')
None
So in your case, you could use (.*?(?= *?\n)).
The \n is captured because the non-capturing group is inside the capturing group:
>>> s = 'foobar\nnextline'
>>> re.search(r'(.*?(?: *?\n))', s).groups()
('foobar\n',)
If you don't want that, place the non-capturing group outside of the capturing one:
>>> re.search(r'(.*?)(?: *?\n)', s).groups()
('foobar',)
I would like to get 2 captured groups for a pair of consecutive words. I use this regular expression:
r'\b(hello)\b(world)\b'
However, searching "hello world" with this regular expression yields no results:
regex = re.compile(r'\b(hello)\b(world)\b')
m = regex.match('hello world') # m evaluates to None.
You need to allow for space between the words:
>>> import re
>>> regex = re.compile(r'\b(hello)\s*\b(world)\b')
>>> regex.match('hello world')
<_sre.SRE_Match object at 0x7f6fcc249140>
>>>
Discussion
The regex \b(hello)\b(world)\b requires that the word hello end exactly where the word world begins but with a word break \b between them. That cannot happen. Adding space, \s, between them fixes this.
If you meant to allow punctuation or other separators between hello and world, then that possibility should be added to the regex.
This code below should be self explanatory. The regular expression is simple. Why doesn't it match?
>>> import re
>>> digit_regex = re.compile('\d')
>>> string = 'this is a string with a 4 digit in it'
>>> result = digit_regex.match(string)
>>> print result
None
Alternatively, this works:
>>> char_regex = re.compile('\w')
>>> result = char_regex.match(string)
>>> print result
<_sre.SRE_Match object at 0x10044e780>
Why does the second regex work, but not the first?
Here is what re.match() says If zero or more characters at the beginning of string match the regular expression pattern ...
In your case the string doesn't have any digit \d at the beginning. But for the \w it has t at the beginning at your string.
If you want to check for digit in your string using same mechanism, then add .* with your regex:
digit_regex = re.compile('.*\d')
The second finds a match because string starts with a word character. If you want to find matches within the string, use the search or findall methods (I see this was suggested in a comment too). Or change your regex (e.g. .*(\d).*) and use the .groups() method on the result.
I am trying to match different expressions for addresses:
Example: '398 W. Broadway'
I would like to match W. or E. (east) or Pl. for place ...etc
It is very simple using this regex
(W.|West) for example.
Yet python re module doesn't match anything when I input that
>>> a
'398 W. Broadway'
>>> x = re.match('(W.|West)', a)
>>> x
>>> x == None
True
>>>
re.match matches at the beginning of the input string.
To match anywhere, use re.search instead.
>>> import re
>>> re.match('a', 'abc')
<_sre.SRE_Match object at 0x0000000001E18578>
>>> re.match('a', 'bac')
>>> re.search('a', 'bac')
<_sre.SRE_Match object at 0x0000000002654370>
See search() vs. match():
Python offers two different primitive operations based on regular
expressions: re.match() checks for a match only at the beginning of
the string, while re.search() checks for a match anywhere in the
string (this is what Perl does by default).
.match() constrains the search to begin at the first character of the string. Use .search() instead. Note too that . matches any character (except a newline). If you want to match a literal period, escape it (\. instead of plain .).