Python regex: alternative positive lookbehind assertion - python

I have the following regex expression which is meant to find the "IF" keyword (case insensitive) in a string. Some constraints are imposed:
It should be preceded by a whitespace or a ) character (from a previous expression)
It should be followed by whitespace or ( character
The below expression accomplishes these constraints. However, this expression does not find the keyword when it's located at the start of a string (if(foo, 1, 2) for instance).
Using something like ^|(?<=[\s\)])(?i)if(?=[\s\(]) does not seem to work. I tried ?:^|[\s\)]) but that seems to also capture the space in front of the keyword.
This is what I have so far:
(?<=[\s\)])(?i)if(?=[\s\(])

You may use an alternation group with two zero-width assertions:
(?i)(?:^|(?<=[\s)]))if(?=[\s(])
^^^^^^^^^^^^^^^^
See the regex demo.
Here, (?:^|(?<=[\s)])) matches:
^ - start of string
| - or
(?<=[\s)]) - a location that is immediately preceded with a whitespace or ) character.
Note that the (?i) inline case insensitive modifier in a Python re regex affects the whole pattern regardless of where it is located in it, so I suggest moving it to the pattern start for better visibility.
Also, there is no need to escape ( and ) inside character classes, [...] constructs, as they are treated as literal parentheses inside them.

The problem is that | is applied at top level, so it is an alteration between:
^ and (?<=[\s\)])(?i)if(?=[\s\(]).
Just add non-capturing group around ^ and (?<=[\s\)]):
(?:^|(?<=[\s\)]))(?i)if(?=[\s\(])

You can solve the problem (for this particular case that only involves single characters) using a double negation:
(?<![^\s)])
(not preceded by a character that is not a whitespace nor a closing parenthesis). This condition includes the start of the string too.

Related

regex to get a substring where the main string's ending is also the substring's enging [duplicate]

I have a string. The end is different, such as index.php?test=1&list=UL or index.php?list=UL&more=1. The one thing I'm looking for is &list=.
How can I match it, whether it's in the middle of the string or it's at the end? So far I've got [&|\?]list=.*?([&|$]), but the ([&|$]) part doesn't actually work; I'm trying to use that to match either & or the end of the string, but the end of the string part doesn't work, so this pattern matches the second example but not the first.
Use:
/(&|\?)list=.*?(&|$)/
Note that when you use a bracket expression, every character within it (with some exceptions) is going to be interpreted literally. In other words, [&|$] matches the characters &, |, and $.
In short
Any zero-width assertions inside [...] lose their meaning of a zero-width assertion. [\b] does not match a word boundary (it matches a backspace, or, in POSIX, \ or b), [$] matches a literal $ char, [^] is either an error or, as in ECMAScript regex flavor, any char. Same with \z, \Z, \A anchors.
You may solve the problem using any of the below patterns:
[&?]list=([^&]*)
[&?]list=(.*?)(?=&|$)
[&?]list=(.*?)(?![^&])
If you need to check for the "absolute", unambiguous string end anchor, you need to remember that is various regex flavors, it is expressed with different constructs:
[&?]list=(.*?)(?=&|$) - OK for ECMA regex (JavaScript, default C++ `std::regex`)
[&?]list=(.*?)(?=&|\z) - OK for .NET, Go, Onigmo (Ruby), Perl, PCRE (PHP, base R), Boost, ICU (R `stringr`), Java/Andorid
[&?]list=(.*?)(?=&|\Z) - OK for Python
Matching between a char sequence and a single char or end of string (current scenario)
The .*?([YOUR_SINGLE_CHAR_DELIMITER(S)]|$) pattern (suggested by João Silva) is rather inefficient since the regex engine checks for the patterns that appear to the right of the lazy dot pattern first, and only if they do not match does it "expand" the lazy dot pattern.
In these cases it is recommended to use negated character class (or bracket expression in the POSIX talk):
[&?]list=([^&]*)
See demo. Details
[&?] - a positive character class matching either & or ? (note the relationships between chars/char ranges in a character class are OR relationships)
list= - a substring, char sequence
([^&]*) - Capturing group #1: zero or more (*) chars other than & ([^&]), as many as possible
Checking for the trailing single char delimiter presence without returning it or end of string
Most regex flavors (including JavaScript beginning with ECMAScript 2018) support lookarounds, constructs that only return true or false if there patterns match or not. They are crucial in case consecutive matches that may start and end with the same char are expected (see the original pattern, it may match a string starting and ending with &). Although it is not expected in a query string, it is a common scenario.
In that case, you can use two approaches:
A positive lookahead with an alternation containing positive character class: (?=[SINGLE_CHAR_DELIMITER(S)]|$)
A negative lookahead with just a negative character class: (?![^SINGLE_CHAR_DELIMITER(S)])
The negative lookahead solution is a bit more efficient because it does not contain an alternation group that adds complexity to matching procedure. The OP solution would look like
[&?]list=(.*?)(?=&|$)
or
[&?]list=(.*?)(?![^&])
See this regex demo and another one here.
Certainly, in case the trailing delimiters are multichar sequences, only a positive lookahead solution will work since [^yes] does not negate a sequence of chars, but the chars inside the class (i.e. [^yes] matches any char but y, e and s).

Regex matching character in substring and excluding trailing characters [duplicate]

I have a string. The end is different, such as index.php?test=1&list=UL or index.php?list=UL&more=1. The one thing I'm looking for is &list=.
How can I match it, whether it's in the middle of the string or it's at the end? So far I've got [&|\?]list=.*?([&|$]), but the ([&|$]) part doesn't actually work; I'm trying to use that to match either & or the end of the string, but the end of the string part doesn't work, so this pattern matches the second example but not the first.
Use:
/(&|\?)list=.*?(&|$)/
Note that when you use a bracket expression, every character within it (with some exceptions) is going to be interpreted literally. In other words, [&|$] matches the characters &, |, and $.
In short
Any zero-width assertions inside [...] lose their meaning of a zero-width assertion. [\b] does not match a word boundary (it matches a backspace, or, in POSIX, \ or b), [$] matches a literal $ char, [^] is either an error or, as in ECMAScript regex flavor, any char. Same with \z, \Z, \A anchors.
You may solve the problem using any of the below patterns:
[&?]list=([^&]*)
[&?]list=(.*?)(?=&|$)
[&?]list=(.*?)(?![^&])
If you need to check for the "absolute", unambiguous string end anchor, you need to remember that is various regex flavors, it is expressed with different constructs:
[&?]list=(.*?)(?=&|$) - OK for ECMA regex (JavaScript, default C++ `std::regex`)
[&?]list=(.*?)(?=&|\z) - OK for .NET, Go, Onigmo (Ruby), Perl, PCRE (PHP, base R), Boost, ICU (R `stringr`), Java/Andorid
[&?]list=(.*?)(?=&|\Z) - OK for Python
Matching between a char sequence and a single char or end of string (current scenario)
The .*?([YOUR_SINGLE_CHAR_DELIMITER(S)]|$) pattern (suggested by João Silva) is rather inefficient since the regex engine checks for the patterns that appear to the right of the lazy dot pattern first, and only if they do not match does it "expand" the lazy dot pattern.
In these cases it is recommended to use negated character class (or bracket expression in the POSIX talk):
[&?]list=([^&]*)
See demo. Details
[&?] - a positive character class matching either & or ? (note the relationships between chars/char ranges in a character class are OR relationships)
list= - a substring, char sequence
([^&]*) - Capturing group #1: zero or more (*) chars other than & ([^&]), as many as possible
Checking for the trailing single char delimiter presence without returning it or end of string
Most regex flavors (including JavaScript beginning with ECMAScript 2018) support lookarounds, constructs that only return true or false if there patterns match or not. They are crucial in case consecutive matches that may start and end with the same char are expected (see the original pattern, it may match a string starting and ending with &). Although it is not expected in a query string, it is a common scenario.
In that case, you can use two approaches:
A positive lookahead with an alternation containing positive character class: (?=[SINGLE_CHAR_DELIMITER(S)]|$)
A negative lookahead with just a negative character class: (?![^SINGLE_CHAR_DELIMITER(S)])
The negative lookahead solution is a bit more efficient because it does not contain an alternation group that adds complexity to matching procedure. The OP solution would look like
[&?]list=(.*?)(?=&|$)
or
[&?]list=(.*?)(?![^&])
See this regex demo and another one here.
Certainly, in case the trailing delimiters are multichar sequences, only a positive lookahead solution will work since [^yes] does not negate a sequence of chars, but the chars inside the class (i.e. [^yes] matches any char but y, e and s).

Python 2.7 regular expression match issue

Suppose I am using the following regular expression to match, logically the regular expression means match anything with prefix foo: and ends with anything which is not a space. Match group will be the parts exclude prefix foo
My question is what exactly means anything in Python 2.7? Any ASCII or? If anyone could share some document, it will be great. Thanks.
a = re.compile('foo:([^ ]+)')
thanks in advance,
Lin
Try:
a = re.compile('foo:\S*')
\S means anything but whitespace.
I recommend you check out http://pythex.org.
It's really good for testing out regular expresions and has a decent cheat-sheet.
UPDATE:
Anything (.) matches anything, all unicode/UTF-8 characters.
The regular expression metacharacter which matches any character is . (dot).
a = re.compile('foo:(.+)')
The character class [^ ] matches any one character which isn't one of the characters between the square brackets (a literal space, in this example). The quantifier + specifies one or more repetitions of the preceding expression.

Regex, better way

How do you separate a regex, that could be matched multiple times within a string, if the delimiter is within the string, ie:
Well then 'Bang bang swing'(BBS) aota 'Bing Bong Bin'(BBB)
With the regex: "'.+'(\S+)"
It would match from Everything from 'Bang ... (BBB) instead of matching 'Bang bang swing'(BBS) and 'Bing Bong Bin'(BBB)
I have a manner of making this work with regex: '[A-z0-9-/?|q~`!##$%^&*()_-=+ ]+'(\S+)
But this is excessive, and honestly I hate that it even works correctly.
I'm fairly new to regexes, and beginning with Pythons implementation of them is apparently not the smartest manner in which to start it.
To get a substring from one character up to another character, where neither can appear in-between, you should always consider using negated character classes.
The [negated] character class matches any character that is not in the character class. Unlike the dot, negated character classes also match (invisible) line break characters. If you don't want a negated character class to match line breaks, you need to include the line break characters in the class. [^0-9\r\n] matches any character that is not a digit or a line break.
So, you can use
'[^']*'\([^()]*\)
See regex demo
Here,
'[^']*' - matches ' followed by 0 or more characters other than ' and then followed by a ' again
\( - matches a literal ) (it must be escaped)
[^()]* - matches 0 or more characters other than ( and ) (they do not have to be escaped inside a character class)
\) - matches a literal ) (must be escaped outside a character class).
If you might have 1 or more single quotes before (...) part, you will need an unrolled lazy matching regex:
'[^']*(?:'(?!\([^()]*\))[^']*)*'\([^()]*\)
See regex demo.
Here, the '[^']*(?:'(?!\([^()]*\))[^']*)*' is matching the same as '.*?' with DOTALL flag, but is much more efficient due to the linear regex execution. See more about unrolling regex technique here.
EDIT:
When input strings are not complex and short, lazy dot matching turns out more efficient. However, when complexity grows, lazy dot matching may cause issues.
How about this regular expression
'.+?'\(\S+\)

Python Regular Expression (\..+)?

I am confused about the semantics of the following Python regular expression:
r"/actors(\\..+)?"
I looked through the Python documentation section on regular expressions, but couldn't make sense of this expression. Can someone help me out?
/ # literal /
actors # literal actors
( # starting a subpattern
\\ # (escaped) literal \
. # arbitrary character
.+ # 1 or more arbitrary characters
)? # ends the subpattern and makes it optional
This would mean, it matches forward slash, 'actors', and then optionally backslash and 2 or more arbitrary characters.
I suppose there is a typo here. Either the string should not have been marked raw, or there is one backslash too much. In both cases there would be an escaped . instead of an escaped \ followed by an arbitrary character. This in turn would matches files, called actors with an arbitrary or missing file extension.
So either "/actors(\\..+)?" or r"/actors(\..+)?".
\\..+
Here, \\ is an escaped \ character, so it does match that one exactly. Following is a . that can match any character, followed by another . that must be there at least once (or more often. So ..+ will match two characters or more. And \\..+ will match any two characters or more, prefixed by a backslash.
(\\..+)?
That all is inside an optional capturing group means this all could be left out as well.
Note that the expression is probably wrong. It looks as if you are trying to match some kind of URL and want to match the file extension, introduced by a . character. However the \\ inside a raw-enquoted string r" " will match the \ character and will not escape the dot itself. So you probably want r"/actors(\..+)?" or "/actors(\\..+)?".
It means: string /actors, follow by an optional capture group, which contains a literal ., and then one or more of whatever the non-literal . is configured to match.

Categories

Resources