This question already has answers here:
What is the difference between re.search and re.match?
(9 answers)
Closed 3 years ago.
While reading the docs, I found out that the whole difference between re.match() and re.search() is that re.match() starts checking only from the beginning of the string.
>>> import re
>>> a = 'abcde'
>>> re.match(r'b', a)
>>> re.search(r'b', a)
<_sre.SRE_Match object at 0xffe25c98>
>>> re.search(r'^b', a)
>>>
Is there anything I am misunderstanding, or is there no difference at all between re.search('^' + pattern) and re.match(pattern)?
Is it a good practice to only use re.search()?
You should take a look at Python's re.search() vs. re.match() document which clearly mentions about the other difference which is:
Note however that in MULTILINE mode match() only matches at the beginning of the string, whereas using search() with a regular expression beginning with '^' will match at the beginning of each line.
>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
<_sre.SRE_Match object; span=(4, 5), match='X'>
The first difference (for future readers) being:
Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default).
For example:
>>> re.match("c", "abcdef") # No match
>>> re.search("c", "abcdef") # Match
<_sre.SRE_Match object; span=(2, 3), match='c'>
Regular expressions beginning with '^' can be used with search() to restrict the match at the beginning of the string:
>>> re.match("c", "abcdef") # No match
>>> re.search("^c", "abcdef") # No match
>>> re.search("^a", "abcdef") # Match
<_sre.SRE_Match object; span=(0, 1), match='a'>
If you look at this from a code golfing perspective, I'd say there is some use in keeping the two functions separate.
If you're looking from the beginning of the string, re.match, would be preferable to re.search, because the former has one character less in its name, thus saving a byte. Furthermore, with re.search, you also have to add the start-of-line anchor ^ to signify matching from the start. You don't need to specify this with re.match because it is implied, further saving another byte.
Related
This question already has answers here:
Regular expression pipe confusion
(5 answers)
Closed 2 years ago.
Anyone know why these two regexes give different results when trying to match either '//' or '$'? (Python 3.6.4)
(a)(//|$) : Matches both 'a' and 'a//'
(a)(//)|($) : Matches 'a//' but not 'a'
>>> at = re.compile('(a)(//|$)')
>>> m = at.match('a')
>>> m
<_sre.SRE_Match object; span=(0, 1), match='a'>
>>> m = at.match('a//')
>>> m
<_sre.SRE_Match object; span=(0, 3), match='a//'>
>>>
vs
>>> at = re.compile('(a)(//)|($)')
>>> m = at.match('a//')
>>> m
<_sre.SRE_Match object; span=(0, 3), match='a//'>
>>> m = at.match('a')
>>> m
>>> type(m)
<class 'NoneType'>
>>>
The regex engine will group the expressions on each side of a pipe before evaluating.
In the first case
(a)(//|$)
implies it'll match a string that must have an a before either // or $ (i.e EOL)
Hence, first alternative in this case is // and second alternative is $, both must follow an a
In this expression, the capturing groups are
a
Either // or $
(a)(//)|($)
implies it'll match a string that must be either a// or $
Hence, first alternative in this case is a// and second alternative is $
In this expression, the capturing groups are
Either
a
//
OR
$
In fact, the grouping doesn't matter in the second example, a//|$ will give the same result, since the regex engine will evaluate it as (a//)|$ (note the parentheses are just symbolic for my example, they do not represent capture group syntax).
Try it out in a regex tester. It'll tell you what the alternatives are for each expression
| has low precedence, so (a)(//)|($) means ((a)(//))|($), therefore it will either math ((a)(//)) or ($). To achieve the results like first one, use (a)((//)|($)), which is same as first with groups added. First regex is cleaner and should be preferred unless you need group matching.
See here for more details on precedence - https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04_08
This question already has an answer here:
Python regular expression re.match, why this code does not work? [duplicate]
(1 answer)
Closed 6 years ago.
In Python:
Given a string 've, I can catch the start of the string with carat:
>>> import re
>>> s = u"'ve"
>>> re.match(u"^[\'][a-z]", s)
<_sre.SRE_Match object at 0x1109ee030>
So it matches even though the length substring after the single quote is > 1.
But for the dollar (matching end of string):
>>> import re
>>> s = u"'ve"
>>> re.match(u"[a-z]$", s)
>>>
In Perl, from here
It seems like the end of string can be matched with:
$s =~ /[\p{IsAlnum}]$/
Is $s =~ /[\p{IsAlnum}]$/ the same as re.match(u"[a-z]$", s) ?
Why is the carat and dollar behavior different? And are they different for Python and Perl?
re.match is implicitly anchored at the start of the string. Quoting the documentation:
re.match(pattern, string, flags=0)
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance.
Try re.search instead.
>>> import re
>>> s = u"'ve"
>>> re.search(u"[a-z]$", s)
<_sre.SRE_Match object at 0x7fea24df3780>
>>>
This question already has answers here:
Checking whole string with a regex
(5 answers)
Closed last year.
Is there any easy way to test whether a regex matches an entire string in Python? I thought that putting $ at the end would do this, but it turns out that $ doesn't work in the case of trailing newlines.
For example, the following returns a match, even though that's not what I want.
re.match(r'\w+$', 'foo\n')
You can use \Z:
\Z
Matches only at the end of the string.
In [5]: re.match(r'\w+\Z', 'foo\n')
In [6]: re.match(r'\w+\Z', 'foo')
Out[6]: <_sre.SRE_Match object; span=(0, 3), match='foo'>
To test whether you matched the entire string, just check if the matched string is as long as the entire string:
m = re.match(r".*", mystring)
start, stop = m.span()
if stop-start == len(mystring):
print("The entire string matched")
Note: This is independent of the question (which you didn't ask) of how to match a trailing newline.
You can use a negative lookahead assertion to require that the $ is not followed by a trailing newline:
>>> re.match(r'\w+$(?!\n)', 'foo\n')
>>> re.match(r'\w+$(?!\n)', 'foo')
<_sre.SRE_Match object; span=(0, 3), match='foo'>
re.MULTILINE is not relevant here; OP has it turned off and the regex is still matching. The problem is that $ always matches right before the trailing newline:
When [re.MULTILINE is] specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.
I have experimentally verified that this works correctly with re.X enabled.
Based on #alexis answer:
A method to check for a fullMatch could look like this:
def fullMatch(matchObject, fullString):
if matchObject is None:
return False
start, stop = matchObject.span()
return stop-start == len(fullString):
Where the fullString is the String on which you apply the regex and the matchObject is the result of matchObject = re.match(yourRegex, fullString)
This code below should be self explanatory. The regular expression is simple. Why doesn't it match?
>>> import re
>>> digit_regex = re.compile('\d')
>>> string = 'this is a string with a 4 digit in it'
>>> result = digit_regex.match(string)
>>> print result
None
Alternatively, this works:
>>> char_regex = re.compile('\w')
>>> result = char_regex.match(string)
>>> print result
<_sre.SRE_Match object at 0x10044e780>
Why does the second regex work, but not the first?
Here is what re.match() says If zero or more characters at the beginning of string match the regular expression pattern ...
In your case the string doesn't have any digit \d at the beginning. But for the \w it has t at the beginning at your string.
If you want to check for digit in your string using same mechanism, then add .* with your regex:
digit_regex = re.compile('.*\d')
The second finds a match because string starts with a word character. If you want to find matches within the string, use the search or findall methods (I see this was suggested in a comment too). Or change your regex (e.g. .*(\d).*) and use the .groups() method on the result.
I am trying to match different expressions for addresses:
Example: '398 W. Broadway'
I would like to match W. or E. (east) or Pl. for place ...etc
It is very simple using this regex
(W.|West) for example.
Yet python re module doesn't match anything when I input that
>>> a
'398 W. Broadway'
>>> x = re.match('(W.|West)', a)
>>> x
>>> x == None
True
>>>
re.match matches at the beginning of the input string.
To match anywhere, use re.search instead.
>>> import re
>>> re.match('a', 'abc')
<_sre.SRE_Match object at 0x0000000001E18578>
>>> re.match('a', 'bac')
>>> re.search('a', 'bac')
<_sre.SRE_Match object at 0x0000000002654370>
See search() vs. match():
Python offers two different primitive operations based on regular
expressions: re.match() checks for a match only at the beginning of
the string, while re.search() checks for a match anywhere in the
string (this is what Perl does by default).
.match() constrains the search to begin at the first character of the string. Use .search() instead. Note too that . matches any character (except a newline). If you want to match a literal period, escape it (\. instead of plain .).