I am trying to play with regular expressions in python. I have framed regular expression as given below. I know that ^ is used to match at the beginning of search string. I have framed by match pattern which contains multiple ^, but I am not sure about how re will try to match the pattern in search string.
re.match("^def/^def", "def/def")
I was expecting that re will be raising error, regarding invalid regular expression, but it doesn't raise any error and returns no matches.
So, my questions is "^def/^def" or "$def/$def" a valid regular expression ?
You do not have an invalid regular expression, ^ has legal uses in the middle of a string. When you use the re.M flag for example:
When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline).
It is also possible to create patterns with optional groups, where a later ^ would still match if all of the preceding pattern matched the empty string. Using the ^ in places it can't match is not something the parser checks for and no error will be raised.
Your specific pattern will never match anything, because the ^ in the middle is unconditional and there is no possibility that the / preceding it will ever match the requisite newline character, even if the multiline flag was enabled.
Related
In my python script it needed a expression like
"\[.*[ERROR].*\n.*\n.*\n.*/\n.*is for multiple time/[\]]{2}"
please let me know how to take "\n." for multiple time... I'm getting stuck in this place
There is the multiline flag available, that let's you match across multiple lines.
https://docs.python.org/2/library/re.html#re.MULTILINE
re.MULTILINE
When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.
You also have access to DOTALL that will have . match even newlines
re.DOTALL
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.
Depending on your match, those two flags let you choose how newlines are handled. In your case, you probably want to adjust your pattern like this:
text = '\n[ [ERROR]\n\nsome text\nis for multiple time]'
re.findall("\[.*\[ERROR\].*is for multiple time\]", text, re.DOTALL)
# result: ['[ [ERROR]\n\nsome text\nis for multiple time]']
I have the following regex expression which is meant to find the "IF" keyword (case insensitive) in a string. Some constraints are imposed:
It should be preceded by a whitespace or a ) character (from a previous expression)
It should be followed by whitespace or ( character
The below expression accomplishes these constraints. However, this expression does not find the keyword when it's located at the start of a string (if(foo, 1, 2) for instance).
Using something like ^|(?<=[\s\)])(?i)if(?=[\s\(]) does not seem to work. I tried ?:^|[\s\)]) but that seems to also capture the space in front of the keyword.
This is what I have so far:
(?<=[\s\)])(?i)if(?=[\s\(])
You may use an alternation group with two zero-width assertions:
(?i)(?:^|(?<=[\s)]))if(?=[\s(])
^^^^^^^^^^^^^^^^
See the regex demo.
Here, (?:^|(?<=[\s)])) matches:
^ - start of string
| - or
(?<=[\s)]) - a location that is immediately preceded with a whitespace or ) character.
Note that the (?i) inline case insensitive modifier in a Python re regex affects the whole pattern regardless of where it is located in it, so I suggest moving it to the pattern start for better visibility.
Also, there is no need to escape ( and ) inside character classes, [...] constructs, as they are treated as literal parentheses inside them.
The problem is that | is applied at top level, so it is an alteration between:
^ and (?<=[\s\)])(?i)if(?=[\s\(]).
Just add non-capturing group around ^ and (?<=[\s\)]):
(?:^|(?<=[\s\)]))(?i)if(?=[\s\(])
You can solve the problem (for this particular case that only involves single characters) using a double negation:
(?<![^\s)])
(not preceded by a character that is not a whitespace nor a closing parenthesis). This condition includes the start of the string too.
I do not understand why the following code snippet returns false. I understand that special characters must be escaped, but re.escape() already does that.
import re
string = re.escape('$')
pattern = re.compile(string)
print(bool(pattern.match(string)))
You are escaping the wrong one. The string to be searched does not need to be modified. But strings you include in the pattern to be matched literally do.
import re
string = '$'
pattern = re.compile(re.escape(string))
print(bool(pattern.match(string)))
Here, pattern \$ (match literal $) is matched against the string "$", and succeeds.
In your example, the pattern \$ (match literal $) is matched against the string "\$" (r"\$" or "\\$" in Python), and fails because match requires the pattern to cover the entire string, and the backslash is left unmatched.
I am using the Python re module.
I can use the regex r'\bA\b' (a raw string) to differentiate between 'A' and 'AA': it will find a match in the string 'A' and no matches in the string 'AA'.
I would like to achieve the same thing with a carat ^ instead of the A: I want a regex which differentiates between '^' and '^^'.
The problem I have is that the regex r'\b\^\b' does not find a match in '^'.
Any ideas?
You need to use lookaround for this:
(?<!\^)\^(?!\^)
\b is a word boundary, a place between a word character and a non-word character, so your pattern is quite non-specific (doesn't say anything about A specifically, A_ would also not match given that _ is a word character.
Here, we assert that there needs to be a place where the preceding character is not a caret, then a caret, then a place where the following character is not a caret (which boils down to "the caret must not be in caret company").
What does this Python regex match?
.*?[^\\]\n
I'm confused about why the . is followed by both * and ?.
* means "match the previous element as many times as possible (zero or more times)".
*? means "match the previous element as few times as possible (zero or more times)".
The other answers already address this, but what they don't bring up is how it changes the regex, well if the re.DOTALL flag is provided it makes a huge difference, because . will match line break characters with that enabled. So .*[^\\]\n would match from the beginning of the string all the way to the last newline character that is not preceeded by a backslash (so several lines would match).
If the re.DOTALL flag is not provided, the difference is more subtle, [^\\] will match everything other than backslash, including line break characters. Consider the following example:
>>> import re
>>> s = "foo\n\nbar"
>>> re.findall(r'.*?[^\\]\n', s)
['foo\n']
>>> re.findall(r'.*[^\\]\n', s)
['foo\n\n']
So the purpose of this regex is to find non-empty lines that don't end with a backslash, but if you use .* instead of .*? you will match an extra \n if you have an empty line following a non-empty line.
This happens because .*? will only match fo, [^\\] will match the second o, and the the \n matches at the end of the first line. However the .* will match foo, the [^\\] will match the \n to end the first line, and the next \n will match because the second line is blank.
. indicates a wild card. It can match anything except a \n, unless the appropriate flag is used.
* indicates that you can have 0 or more of the thing preceding it.
? indicates that the preceding quantifier is lazy. It will stop searching after the first match it finds.
Opening the Python re module documentation, and searching for *?, we find:
*?, +?, ??:
The *, +, and ? qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against <H1>title</H1>, it will match the entire string, and not just <H1>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only <H1>.