functional difference between lookarounds and non-capture group? - python

I'm trying to come up with an example where positive look-around works but
non-capture groups won't work, to further understand their usages. The examples I"m coming up with all work with non-capture groups as well, so I feel like I"m not fully grasping the usage of positive look around.
Here is a string, (taken from a SO example) that uses positive look ahead in the answer. The user wanted to grab the second column value, only if the value of the
first column started with ABC, and the last column had the value 'active'.
string ='''ABC1 1.1.1.1 20151118 active
ABC2 2.2.2.2 20151118 inactive
xxx x.x.x.x xxxxxxxx active'''
The solution given used 'positive look ahead' but I noticed that I could use non-caputure groups to arrive at the same answer.
So, I'm having trouble coming up with an example where positive look-around works, non-capturing group doesn't work.
pattern =re.compile('ABC\w\s+(\S+)\s+(?=\S+\s+active)') #solution
pattern =re.compile('ABC\w\s+(\S+)\s+(?:\S+\s+active)') #solution w/out lookaround
If anyone would be kind enough to provide an example, I would be grateful.
Thanks.

The fundamental difference is the fact, that non-capturing groups still consume the part of the string they match, thus moving the cursor forward.
One example where this makes a fundamental difference is when you try to match certain strings, that are surrounded by certain boundaries and these boundaries can overlap. Sample task:
Match all as from a given string, that are surrounded by bs - the given string is bababaca. There should be two matches, at positions 2 and 4.
Using lookarounds this is rather easy, you can use b(a)(?=b) or (?<=b)a(?=b) and match them. But (?:b)a(?:b) won't work - the first match will also consume the b at position 3, that is needed as boundary for the second match. (note: the non-capturing group isn't actually needed here)
Another rather prominent sample are password validations - check that the password contains uppercase, lowercase letters, numbers, whatever - you can use a bunch of alternations to match these - but lookaheads come in way easier:
(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])(?=.*[!?.])
vs
(?:.*[a-z].*[A-Z].*[0-9].*[!?.])|(?:.*[A-Z][a-z].*[0-9].*[!?.])|(?:.*[0-9].*[a-z].*[A-Z].*[!?.])|(?:.*[!?.].*[a-z].*[A-Z].*[0-9])|(?:.*[A-Z][a-z].*[!?.].*[0-9])|...

Related

Regex - Group everything until last occurence

When working on this string:
see.Ya23.v2.0023.jpg
I already found out I could get the last occurence of a number by using:
(?P<Frame>\d+(?!.*\d))
It gives me the group containing "0023".
But how do I group everything until that happens?
If I do this:
(?P<Sequence>.*)(?P<Frame>\d+(?!.*\d))
My two groups contain "see.Ya23.v2.002" and "3", when I would like to have to have them contain "see.Ya23.v2." and "0023".
Hope you can help me. Thanks in advance.
You almost got it completely.
just in the first group you can add the lazy indicator ? after any match. that causes to drop the selection at the first possible possition.
(?P<Sequence>.*?)(?P<Frame>\d+(?!.*\d))
this will give you
see.Ya23.v2. and 0023
and if you also want to avoid selecting the dot
(?P<Sequence>.*?)\.(?P<Frame>\d+(?!.*\d))
the result is see.Ya23.v2 and 0023
The simplest and quickest way is to put a negative assertion for a digit
before your digit expression at the start of the Frame group.
This will make sure the Frame is the last complete set of digits and
still allow a greedy Sequence match which give a performance boost.
(?P<Sequence>.*)(?P<Frame>(?<!\d)\d+(?!.*\d))
https://regex101.com/r/LCUoCR/1
The problem is explained in my Youtube video related to how backtracking works in regex.
In short: the .* part matches the whole string first, and then the regex engine starts stepping back through the string to accommodate a part for the subsequent patterns, i.e. for \d+(?!.*\d). Once the 3 is found in see.Ya23.v2.0023.jpg, this pattern matches, and the regex engine returns a match.
All you need is to make sure the char before the \d+ is a non-digit char and you need to use
(?P<Sequence>(?:.*\D)?)(?P<Frame>\d+)(?!.*\d)
See the regex demo.

Regex - how to exclude 4 digit number from wider numeric pattern

Really tried to browse and search if this specific question has been posted previously, so I hope I'm not asking an obvious one here.
My problem: I have a regex expression, with few different possible criteria for pattern matching separated by pipes. I'm OK with all of them except one, where I basically want to:
Find any expression which would be number between 4 and 6 digits (regardless of position in the string)
Exclude from this pattern expressions which would relate to years in this century (so starting with 20 and followed by two digits)
So for example, I would like to match: 4149, 20259, 202046, but would like to exclude 2019 as it will refer to a year and not the code I'm searching for.
Currently, I tried applying this one (only last part of the expression): |\d{4,6}?!20\d{2}) , but it's not working properly. I know that the expressions preceeding pipe are fine and was able to notice that \d{4,6} stops to work once I add the "exclusion" in this case, so I assume I'm not using the ?! properly. Could I ask you for an advice on this one?
Edit: Solved! Thank you very much for immediate answers (I was really positively surprised how fast there were few alternative solutions). Sorry I had to pick just one, all of others would be adjustable and viable for my needs, I just found this one most appealing and tailored for my needs.
Where I'm not sure if word-boundaries are your best bet to indicate boundaries (maybe \D is better?), you could try:
\b(?!20\d\d\b)\d{4,6}\b
See the Online Demo
\b - Word boundary.
(?!20\d\d\b) - Negative lookahead: No literal 20 followed by two digits and a word boundary.
\d{4,6} - Four to six digits.
\b - Word boundary.
You could use the following regular expression.
r'\b(?:20\d{3,4}|2[1-9]\d{2,4}|[1,3-9]\d{3,5})\b'
Demo
This should work:
[013-9][1-9]\d{2}|\d{5,6}
Match all 4 digit sequences, except the ones starting with 20, and all 5 or 6 digit sequences

What am I doing wrong with this negative lookahead? Filtering out certain numbers in a regex

I have a big piece of code produced by a software. Each instruction has an identifier number and I have to modify only certain numbers:
grr.add(new GenericRuleResult(RULEX_RULES.get(String.valueOf(11)), new Result(0,Boolean.FALSE,"ROSSO")));
grr.add(new GenericRuleResult(RULEX_RULES.get(String.valueOf(12)), new Result(0,Boolean.FALSE,"£££")));
etc...
Now, I am using SublimeText3 to change rapidly all of the wrong lines with this regex:
Of\((11|14|19|20|21|27|28|31)\)\), new Result\(
This regex above allowed me to put "ROSSO" (red) in each line containing those numbers. Now I have to put "VERDE" (green) in the remaining lines. My idea was to add a ?! in the Regex to look for all of the lines NOT CONTAINING those numbers.
From the website Regex101 I get in the description of the regex:
Of matches the characters Of literally (case sensitive)
\( matches the character ( literally (case sensitive)
Negative Lookahead (?!11|14|19|20|21|27|28|31)
Assert that the Regex below does not match
1st Alternative 11
etc...
So why am I not finding the lines containing 12, 13, 14 etc?
Edit: the Actual Regex: Of\((?!11|14|19|20|21|27|28|31)\)\), new Result\(
Your problem is that you are assuming a negative look ahead changes the cursor position, it does not.
That is, a negative lookahead of the form (?!xy) merely verifies that the next two characters are not xy. It does not then swallow two characters from the text. As its name suggests, it merely looks ahead from where you are, without moving ahead!
Thus, if you wish to match further things beyond that assertion you must:
negatively assert it is not xy;
then consume the two characters for whatever they are;
then continue your match.
So try something like:
Of\((?!11|14|19|20|21|27|28|31)..\)\), new Result\(

Python re module groups match mechanism

Question Formation
background
As I am reading through the tutorial at python2.7 redoc, it introduces the behavior of the groups:
The groups() method returns a tuple containing the strings for all the subgroups, from 1 up to however many there are.
question
I clearly understands how this works singly. but I can understand the following example:
>>> m = re.match("([abc])+","abc")
>>> m.groups()
('c',)
I mean, isn't + simply means one or more. If so, shouldn't the regex ([abc])+ = ([abc])([abc])+ (not formal BNF). Thus, the result should be:
('a','b','c')
Please shed some light about the mechanism behind, thanks.
P.S
I want to learn the regex language interpreter, how should I start with? books or regex version, thanks!
Well, I guess a picture is worth a 1000 words:
link to the demo
what's happening is that, as you can see on the visual representation of the automaton, your regexp is grouping over a one character one or more times until it reaches the end of the match. Then that last character gets into the group.
If you want to get the output you say, you need to do something like the following:
([abc])([abc])([abc])
which will match and group one character at each position.
About documentation, I advice you to read first theory of NFA, and regexps. The MIT documentation on the topic is pretty nice:
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-045j-automata-computability-and-complexity-spring-2011/lecture-notes/
Basically, the groups that are referred to in regex terminology are the capture groups as defined in your regex.
So for example, in '([abc])+', there's only a single capture group, namely, ([abc]), whereas in something like '([abc])([xyz])+' there are 2 groups.
So in your example, calling .groups() will always return a tuple of length 1 because that is how many groups exist in your regex.
The reason why it isn't returning the results you'd expect is because you're using the repeat operator + outside of the group. This ends up causing the group to equal only the last match, and thus only the last match (c) is retained. If, on the other hand, you had used '([abc]+)' (notice the + is inside the capture group), the results would have been:
('abc',)
One pair of grouping parentheses forms one group, even if it's inside a quantifier. If a group matches multiple times due to a quantifier, only the last match for that group is saved. The group doesn't become as many groups as it had matches.

Negating match if a string is just before another string

I'm struggling to get a regex to work where it matches a certain pattern, so long as isn't proceeded by another. For example,
Accessory for MyProduct01 <<< Should be classified as an accessory
MyProduct01 with accessory << Should be classified as a product
So I need to add something to my 'accessory' regex, something like 'match "accessory" so long as the word before isn't "with"'.
I have seen some examples where people are using negative lookaheads to find if a word is anywhere in the string, but I want to be a bit more specific regarding the position of the word to negate. Something like:
(?!with\s)accessory
Just use a negative look-behind in your regex:
(?<!with\s)accessory
Since Python doesn't support unbounded lookbehinds, I think you are going to have to use a lookahead similar to what you are currently using, but change the original pattern a bit.
^(?!\bwith\b.*\baccessory\b)(?=.*\b(accessory)\b)
Here, the negative lookahead is used to ensure that "accessory" doesn't come after the word "with". Then, the positive lookahead is used to ensure that the word "accessory" occurs within the string, captured with a group if you need to capture it for some reason.
Based on the way that I wrote the above, you'd want to use the search method and not the match method. In order to use match, which requires that the entire search string match the pattern, you'd need to add a bit more to the pattern:
^(?!\bwith\b.*\baccessory\b)(?=.*\b(accessory)\b).*$

Categories

Resources