This question already has answers here:
Checking whole string with a regex
(5 answers)
Closed last year.
Is there any easy way to test whether a regex matches an entire string in Python? I thought that putting $ at the end would do this, but it turns out that $ doesn't work in the case of trailing newlines.
For example, the following returns a match, even though that's not what I want.
re.match(r'\w+$', 'foo\n')
You can use \Z:
\Z
Matches only at the end of the string.
In [5]: re.match(r'\w+\Z', 'foo\n')
In [6]: re.match(r'\w+\Z', 'foo')
Out[6]: <_sre.SRE_Match object; span=(0, 3), match='foo'>
To test whether you matched the entire string, just check if the matched string is as long as the entire string:
m = re.match(r".*", mystring)
start, stop = m.span()
if stop-start == len(mystring):
print("The entire string matched")
Note: This is independent of the question (which you didn't ask) of how to match a trailing newline.
You can use a negative lookahead assertion to require that the $ is not followed by a trailing newline:
>>> re.match(r'\w+$(?!\n)', 'foo\n')
>>> re.match(r'\w+$(?!\n)', 'foo')
<_sre.SRE_Match object; span=(0, 3), match='foo'>
re.MULTILINE is not relevant here; OP has it turned off and the regex is still matching. The problem is that $ always matches right before the trailing newline:
When [re.MULTILINE is] specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.
I have experimentally verified that this works correctly with re.X enabled.
Based on #alexis answer:
A method to check for a fullMatch could look like this:
def fullMatch(matchObject, fullString):
if matchObject is None:
return False
start, stop = matchObject.span()
return stop-start == len(fullString):
Where the fullString is the String on which you apply the regex and the matchObject is the result of matchObject = re.match(yourRegex, fullString)
Related
I have a string like this
"yJdz:jkj8h:jkhd::hjkjh"
I want to split it using colon as a separator, but not a double colon. Desired result:
("yJdz", "jkj8h", "jkhd::hjkjh")
I'm trying with:
re.split(":{1}", "yJdz:jkj8h:jkhd::hjkjh")
but I got a wrong result.
In the meanwhile I'm escaping "::", with string.replace("::", "$$")
You could split on (?<!:):(?!:). This uses two negative lookarounds (a lookbehind and a lookahead) which assert that a valid match only has one colon, without a colon before or after it.
To explain the pattern:
(?<!:) # assert that the previous character is not a colon
: # match a literal : character
(?!:) # assert that the next character is not a colon
Both lookarounds are needed, because if there was only the lookbehind, then the regular expression engine would match the first colon in :: (because the previous character isn't a colon), and if there was only the lookahead, the second colon would match (because the next character isn't a colon).
You can do this with lookahead and lookbehind, if you want:
>>> s = "yJdz:jkj8h:jkhd::hjkjh"
>>> l = re.split("(?<!:):(?!:)", s)
>>> print l
['yJdz', 'jkj8h', 'jkhd::hjkjh']
This regex essentially says "match a : that is not followed by a : or preceded by a :"
This question already has answers here:
What is the difference between re.search and re.match?
(9 answers)
Closed 3 years ago.
While reading the docs, I found out that the whole difference between re.match() and re.search() is that re.match() starts checking only from the beginning of the string.
>>> import re
>>> a = 'abcde'
>>> re.match(r'b', a)
>>> re.search(r'b', a)
<_sre.SRE_Match object at 0xffe25c98>
>>> re.search(r'^b', a)
>>>
Is there anything I am misunderstanding, or is there no difference at all between re.search('^' + pattern) and re.match(pattern)?
Is it a good practice to only use re.search()?
You should take a look at Python's re.search() vs. re.match() document which clearly mentions about the other difference which is:
Note however that in MULTILINE mode match() only matches at the beginning of the string, whereas using search() with a regular expression beginning with '^' will match at the beginning of each line.
>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
<_sre.SRE_Match object; span=(4, 5), match='X'>
The first difference (for future readers) being:
Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default).
For example:
>>> re.match("c", "abcdef") # No match
>>> re.search("c", "abcdef") # Match
<_sre.SRE_Match object; span=(2, 3), match='c'>
Regular expressions beginning with '^' can be used with search() to restrict the match at the beginning of the string:
>>> re.match("c", "abcdef") # No match
>>> re.search("^c", "abcdef") # No match
>>> re.search("^a", "abcdef") # Match
<_sre.SRE_Match object; span=(0, 1), match='a'>
If you look at this from a code golfing perspective, I'd say there is some use in keeping the two functions separate.
If you're looking from the beginning of the string, re.match, would be preferable to re.search, because the former has one character less in its name, thus saving a byte. Furthermore, with re.search, you also have to add the start-of-line anchor ^ to signify matching from the start. You don't need to specify this with re.match because it is implied, further saving another byte.
I am writing a Python script to find a tag name in a string like this:
string='Tag Name =LIC100 State =TRUE'
If a use a expression like this
re.search('Name(.*)State',string)
I get " =LIC100". I would like to get just LIC100.
Any suggestions on how to set up the pattern to eliminate the whitespace and the equal signal?
That is because you get 0+ chars other than line break chars from Name up to the last State. You may restrict the pattern in Group 1 to just non-whitespaces:
import re
string='Tag Name =LIC100 State =TRUE'
m = re.search(r'Name\s*=(\S*)',string)
if m:
print(m.group(1))
See the Python demo
Pattern details:
Name - a literal char sequence
\s* - 0+ whitespaces
= - a literal =
(\S*) - Group 1 capturing 0+ chars other than whitespace (or \S+ can be used to match 1 or more chars other than whitespace).
The easiest solution would probably just be to strip it out after the fact, like so:
s = " =LIC100 "
s = s.strip('= ')
print(s)
#LIC100
If you insist on doing it within the regex, you can try something like:
reg = r'Name[ =]+([A-Za-z0-9]+)\s+State'
Your current regex is failing because (.*) captures all characters until the occurance of State. Instead of capturing everything, you can use a positive lookbehind to describe what preceeds, but is not included in, the content you actually want to capture. In this case, "Name =" preceeds the match, so we can stick it in the lookbehind assertion as (?<=Name =), then proceed to capture everything until the next whitespace:
>>> import re
>>> s = 'Tag Name =LIC100 State =TRUE'
>>> r = re.compile("(?<=Name =)\w*")
>>> print(r.search(s))
<_sre.SRE_Match object; span=(10, 16), match='LIC100'>
>>> print(r.search(s).group(0))
LIC100
Following the tips above, I manage to find a nice solution.
Actually, the string I am trying to process has some non-printable characters. It is like this
"Tag Name\x00=LIC100\x00\tState=TRUE"
Using the concept of lookahead and lookbehind I found the following solution:
import re
s = 'Tag Name\x00=LIC100\x00\tState=TRUE'
T=re.search(r'(?<=Name\x00=)(.*)(?=\x00\tState)',s)
print(T.group(0))
The nice thing about this is that the outcome does not have any non-printable character on it.
<_sre.SRE_Match object; span=(10, 16), match='LIC100'>
I've got a file with lines for example:
aaa$bb$ccc$ddd$eee
fff$ggg$hh$iii$jj
I need to take what is inside $$ so expected result is:
$bb$
$ddd$
$ggg$
$iii$
My result:
$bb$
$ggg$
My solution:
m = re.search(r'$(.*?)$', line)
if m is not None:
print m.group(0)
Any ideas how to improve my regexp? I was trying with * and + sign, but I'm not sure how to finally create it.
I was searching for similar post, but couldnt find it :(
You can use re.findall with r'\$[^$]+\$' regex:
import re
line = """aaa$bb$ccc$ddd$eee
fff$ggg$hh$iii$jj"""
m = re.findall(r'\$[^$]+\$', line)
print(m)
# => ['$bb$', '$ddd$', '$ggg$', '$iii$']
See Python demo
Note that you need to escape $s and remove the capturing group for the re.findall to return the $...$ substrings, not just what is inside $s.
Pattern details:
\$ - a dollar symbol (literal)
[^$]+ - 1 or more symbols other than $
\$ - a literal dollar symbol.
NOTE: The [^$] is a negated character class that matches any char but the one(s) defined in the class. Using a negated character class here speeds up matching since .*? lazy dot pattern expands at each position in the string between two $s, thus taking many more steps to complete and return a match.
And a variation of the pattern to get only the texts inside $...$s:
re.findall(r'\$([^$]+)\$', line)
^ ^
See another Python demo. Note the (...) capturing group added so that re.findall could only return what is captured, and not what is matched.
re.search finds only the first match. Perhaps you'd want re.findall, which returns list of strings, or re.finditer that returns iterator of match objects. Additionally, you must escape $ to \$, as unescaped $ means "end of line".
Example:
>>> re.findall(r'\$.*?\$', 'aaa$bb$ccc$ddd$eee')
['$bb$', '$ddd$']
>>> re.findall(r'\$(.*?)\$', 'aaa$bb$ccc$ddd$eee')
['bb', 'ddd']
One more improvement would be to use [^$]* instead of .*?; the former means "zero or more any characters besides $; this can potentially avoid more pathological backtracking behaviour.
Your regex is fine. re.search only finds the first match in a line. You are looking for re.findall, which finds all non-overlapping matches. That last bit is important for you since you have the same start and end delimiter.
for m in m = re.findall(r'$(.*?)$', line):
if m is not None:
print m.group(0)
This code below should be self explanatory. The regular expression is simple. Why doesn't it match?
>>> import re
>>> digit_regex = re.compile('\d')
>>> string = 'this is a string with a 4 digit in it'
>>> result = digit_regex.match(string)
>>> print result
None
Alternatively, this works:
>>> char_regex = re.compile('\w')
>>> result = char_regex.match(string)
>>> print result
<_sre.SRE_Match object at 0x10044e780>
Why does the second regex work, but not the first?
Here is what re.match() says If zero or more characters at the beginning of string match the regular expression pattern ...
In your case the string doesn't have any digit \d at the beginning. But for the \w it has t at the beginning at your string.
If you want to check for digit in your string using same mechanism, then add .* with your regex:
digit_regex = re.compile('.*\d')
The second finds a match because string starts with a word character. If you want to find matches within the string, use the search or findall methods (I see this was suggested in a comment too). Or change your regex (e.g. .*(\d).*) and use the .groups() method on the result.