Strange behavior of capturing group in regular expression - python

Given the following simple regular expression which goal is to capture the text between quotes characters:
regexp = '"?(.+)"?'
When the input is something like:
"text"
The capturing group(1) has the following:
text"
I expected the group(1) to have text only (without the quotes). Could somebody explain what's going on and why the regular expression is capturing the " symbol even when it's outside the capturing group #1. Another strange behavior that I don't understand is why the second quote character is captured but not the first one given that both of them are optional. Finally I fixed it by using the following regex, but I would like to understand what I'm doing wrong:
regexp = '"?([^"]+)"?'

Quantifiers in regular expressions are greedy: they try to match as much text as possible. Because your last " is optional (you wrote "? in your regular expression), the .+ will match it.
Using [^"] is one acceptable solution. The drawback is that your string cannot contain " characters (which may or may not be desirable, depending on the case).
Another is to make " required:
regexp = '"(.+)"'
Another one is to make the + non-greedy, by using +?. However you also need to add anchors ^ and $ (or similar, depending on the context), otherwise it'll match only the first character (t in the case of "test"):
regexp = '^"?(.+?)"?$'
This regular expression allows " characters to be in the middle of the string, so that "t"e"s"t" will result in t"e"s"t being captured by the group.

why the regular expression is capturing the " symbol even when it's outside the capturing group #1
The "?(.+)"? pattern contains a greedy dot matching subpattern. A . can match a ", too. The "? is an optional subpattern. It means that if the previous subpattern is greedy (and .+ is a greedy subpattern) and can match the subsequent subpattern (and . can match a "), the .+ will take over that optional value.
The negated character class is a correct way to match any characters but a certain one/range(s) of characters. [^"] will never match a ", so the last " will never get matched with this pattern.
why the second quote character is captured but not the first one given that both of them are optional
The first "? comes before the greedy dot matching pattern. The engine sees the " (if it is in the string) and matches the quote with the first "?.

.+ is greedy. It'll collect everything including the ". Your final "? doesn't require that a quote be present, hence .+ includes the quote.
The first quote isn't captured because it's matched by the "?

The regexp is greedy by default, it will try to match as much as possible as soon as possible.
Since your capturing group contains .+, this will match the ending parenthesis before the "?. Then, when exiting the group, it is at the end of your line, which is matched by the optional ".

.+ matches any character as long as it can (including the "). And when it reaches end of the input the "? is matching as it means the " is optional.
You should use "non greedy":
regex
"(.+?)"

Related

Python last character zeo or one time

Input string could be any of the below strings.
image: xyz.com/elk_init_cos7:1.0.0-20.12.0
image: "xyz.com/ckaf/kafka:4.1.0-5.4.1-59"
import re
mat = re.search("image:\s*\"?(.+?/(.+?):(.+?))\"?", str)
if mat:
print (mat.group(1))
print (mat.group(2))
print (mat.group(3))
Ouptut:
artifactory.net.nokia.com/ckaf/kafka:4
ckaf/kafka
4
If I use regex as "image:\s*"?(.+?/(.+?):(.+))"?", then I am getting the string with double quote 4.1.0-5.4.1-59".
How can I get last part of the string without " coming at end and still satisfy other input string also?
The (.+?))\"? part of the pattern when used at the end of string is matching very few chars because .+? only has to match a single char, then it goes on to check for a ", and if there is no " the single char captured with (.+?). There is no obligation here to proceed matching until a " char.
The (.+))\"? at the end of the pattern will match and capture text up to the end of the line, and \"? will match nothing (or, in other words, empty string).
You want to match anything but a " char, one or more times here.
image:\s*\"?(.+?/(.+?):([^\"]+))
See the regex demo. I added \n at the online demo just to make sure the match does not go across lines, if the line are standalone strings in your real scenario, you do not need it.
You may use the negated character classes in other places of your regex, too:
image:\s*\"?([^/]+/([^:]+):([^\"]+))
See this regex demo.
The [^/]+/([^:]+) part now matches and captures into Group 1 any one or more chars other than / (with ([^/]+)), then matches a / char, and then captures into Group 2 any one or more chars other than a : char (([^:]+)).
re.search("image:\s*\"?(.+?/(.+?):(.+?))\"?$", str)
when I placed $ at the end, it addressed my issue. Thank you for sharing your inputs.

Regular Expression to match a mandatory symbol in an optional part of a string?

What is the regular expression that matches for a mandatory symbol in an optional part of a string.
For example, abcd will be matched by the RE but, if I add :, the resulting string will not be matched unless I add letter(s) afterwards like this abcd:efg.
So, the optional part is the : onward, and the mandatory symbol in this optional part is the : itself.
abcd:efg:hijk need also to be matched.
UPDATE:
I tried this ^([a-z]|_)*(:[a-z]|_)*$ but it did not work as expected.
You should include more examples and counter-examples, but this should be close enough to your goal:
^[a-z_]+(:[a-z_]+)*$
Here's a test.
The problem with your ^([a-z]|_)*(:[a-z]|_)*$ regex is that it only matches one letter after each :. a:b:c:d matches but not a:b:c:de.
Finally, please note that (:[a-z]|_) is :
a colon followed by a letter
or an underscore.
It doesn't match a colon followed by an underscore!
I would prefer a regex with a positive lookbehind. This also makes it easier to group the matching parts. It first matches the first string, and then matches all the following strings when preceded with a ":"
([a-z_]*)((?<=:):[a-z_])?
https://regex101.com/r/NkiZ3g/1
Your problem is that you need to know how to express optionality for a stretch longer than a single character. Try this:
^abcd(:efg)?$
For abcd and efg substitute whatever you're really looking for.

Pattern for '.' separated words with arbitrary number of whitespaces

It's the first time that I'm using regular expressions in Python and I just can't get it to work.
Here is what I want to achieve: I want to find all strings, where there is a word followed by a dot followed by another word. After that an unknown number of whitespaces followed by either (off) or (on). For example:
word1.word2 (off)
Here is what I have come up so far.
string_group = re.search(r'\w+\.\w+\s+[(\(on\))(\(off\))]', analyzed_string)
\w+ for the first word
\. for the dot
\w+ for the second word
\s+ for the whitespaces
[(\(on\))(\(off\))] for the (off) or (on)
I think that the last expression might not be doing what I need it to. With the implementation right now, the program does find the right place in the string, but the output of
string_group.group(0)
Is just
word1.word2 (
instead of the whole expression I'm looking for. Could you please give me a hint what I am doing wrong?
[ ... ] is used for character class, and will match any one character inside them unless you put a quantifier: [ ... ]+ for one or more time.
But simply adding that won't work...
\w+\.\w+\s+[(\(on\))(\(off\))]+
Will match garbage stuff like word1.word2 )(fno(nofn too, so you actually don't want to use a character class, because it'll match the characters in any order. What you can use is a capturing group, and a non-capturing group along with an OR operator |:
\w+\.\w+\s+(\((?:on|off)\))
(?:on|off) will match either on or off
Now, if you don't like the parentheses, to be caught too in the first group, you can change that to:
\w+\.\w+\s+\((on|off)\)
You've got your logical OR mixed up.
[(\(on\))(\(off\))]
should be
\((?:on|off)\)
[]s are just for matching single characters.
The square brackets are a character class, which matches any one of the characters in the brackets. You appear to be trying to use it to match one of the sub-regexes (\(one\)) and (\(two\)). The way to do that is with an alternation operation, the pipe symbol: (\(one\)|\(two\)).
I think your problem may be with the square brackets []
they indicate a set of single characters to match. So your expression would match a single instance of any of the following chars: "()ofn"
So for the string "word1.word2 (on)", you are matching only this part: "word1.word2 ("
Try using this one instead:
re.search(r'\w+\.\w+\s+\((on|off)\)', analyzed_string)
This match assumes that the () will be there, and looks for either "on" or "off" inside the parenthesis.

Finding big string sequence between two keywords within multiple lines

I have a file with the format of
sjaskdjajldlj_abc:
cdf_asjdl_dlsf1:
dfsflks %jdkeajd
sdjfls:
adkfld %dk_.(%sfj)sdaj, %kjdflajfs
afjdfj _ajhfkdjf
zjddjh -15afjkkd
xyz
and I want to find the text in between the string _abc: in the first line and xyz in the last line.
I have already tried print
re.findall(re.escape("*_abc:")+"(*)"+re.escape("xyz"),line)
But I got null.
If I understood the requirement correctly:
a1=re.search(r'_abc(.*)xyz',line,re.DOTALL)
print a1.group(1)
Use re.DOTALL which will enable . to match a newline character as well.
You used re.escape on your pattern when it contains special characters, so there's no way it will work.
>>>>re.escape("*_abc:")
'\\*_abc\\:'
This will match the actual phrase *_abc:, but that's not what you want.
Just take the re.escape calls out and it should work more or less correctly.
It sounds like you have a misunderstanding about what the * symbol means in a regular expression. It doesn't mean "match anything", but rather "repeat the previous thing zero or more times".
To match any string, you need to combine * with ., which matches any single character (almost, more on this later). The pattern .* matches any string of zero or more characters.
So, you could change your pattern to be .*abc(.*)xyz and you'd be most of the way there. However, if the prefix and suffix only exist once in the text the leading .* is unnecessary. You can omit it and just let the regular expression engine handle skipping over any unmatched characters before the abc prefix.
The one remaining issue is that you have multiple lines of text in your source text. I mentioned above that the . patter matches character, but that's not entirely true. By default it won't match a newline. For single-line texts that doesn't matter, but it will cause problems for you here. To change that behavior you can pass the flag re.DOTALL (or its shorter spelling, re.S) as a third argument to re.findall or re.search. That flag tells the regular expression system to allow the . pattern to match any character including newlines.
So, here's how you could turn your current code into a working system:
import re
def find_between(prefix, suffix, text):
pattern = r"{}.*{}".format(re.escape(prefix), re.escape(suffix))
result = re.search(pattern, text, re.DOTALL)
if result:
return result.group()
else:
return None # or perhaps raise an exception instead
I've simplified the pattern a bit, since your comment suggested that you want to get the whole matched text, not just the parts in between the prefix and suffix.

Python regex with *?

What does this Python regex match?
.*?[^\\]\n
I'm confused about why the . is followed by both * and ?.
* means "match the previous element as many times as possible (zero or more times)".
*? means "match the previous element as few times as possible (zero or more times)".
The other answers already address this, but what they don't bring up is how it changes the regex, well if the re.DOTALL flag is provided it makes a huge difference, because . will match line break characters with that enabled. So .*[^\\]\n would match from the beginning of the string all the way to the last newline character that is not preceeded by a backslash (so several lines would match).
If the re.DOTALL flag is not provided, the difference is more subtle, [^\\] will match everything other than backslash, including line break characters. Consider the following example:
>>> import re
>>> s = "foo\n\nbar"
>>> re.findall(r'.*?[^\\]\n', s)
['foo\n']
>>> re.findall(r'.*[^\\]\n', s)
['foo\n\n']
So the purpose of this regex is to find non-empty lines that don't end with a backslash, but if you use .* instead of .*? you will match an extra \n if you have an empty line following a non-empty line.
This happens because .*? will only match fo, [^\\] will match the second o, and the the \n matches at the end of the first line. However the .* will match foo, the [^\\] will match the \n to end the first line, and the next \n will match because the second line is blank.
. indicates a wild card. It can match anything except a \n, unless the appropriate flag is used.
* indicates that you can have 0 or more of the thing preceding it.
? indicates that the preceding quantifier is lazy. It will stop searching after the first match it finds.
Opening the Python re module documentation, and searching for *?, we find:
*?, +?, ??:
The *, +, and ? qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against <H1>title</H1>, it will match the entire string, and not just <H1>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only <H1>.

Categories

Resources