regex negative lookaround - python

I'm trying to do keyword matching using the following regular expression
you.{0,50}(?<!not)\s?special
on the following list of strings
to include:
youaresospecial
you are so special
you are pretty special
you are special
youarespecial
you are sospecial
you are very special
you are super special
you are special
you special
you aresospecial
to exclude:
youarenotspecial
you are not special
youarenotspecial
it matches all of the strings that I need to include, however it also highlights one of the strings that I would like to exclude ('you are not special')
have been testing this on https://regex101.com/r/KTsjn8/1
can someone point out why?

To explain why your regex fails:
Take you are not special.
you.{0,50} matches you are not  (note the space)
(?<!not) matches because not  is not not
\s? matches because it's optional
special matches.
To fix this, you can use a negative lookahead instead:
you(?!.*not\s?special).{0,50}special

Your regex does not work because \s? allows the pattern to match a zero-width position behind special and then successfully assert at that position that there is no not behind it with (?<!not).
You would have to make two negative lookbehind assertions instead, one with a space, and one without:
you.{0,50}(?<!not\s)(?<!not)special
Demo: https://regex101.com/r/KTsjn8/2

Related

regex to get a substring where the main string's ending is also the substring's enging [duplicate]

I have a string. The end is different, such as index.php?test=1&list=UL or index.php?list=UL&more=1. The one thing I'm looking for is &list=.
How can I match it, whether it's in the middle of the string or it's at the end? So far I've got [&|\?]list=.*?([&|$]), but the ([&|$]) part doesn't actually work; I'm trying to use that to match either & or the end of the string, but the end of the string part doesn't work, so this pattern matches the second example but not the first.
Use:
/(&|\?)list=.*?(&|$)/
Note that when you use a bracket expression, every character within it (with some exceptions) is going to be interpreted literally. In other words, [&|$] matches the characters &, |, and $.
In short
Any zero-width assertions inside [...] lose their meaning of a zero-width assertion. [\b] does not match a word boundary (it matches a backspace, or, in POSIX, \ or b), [$] matches a literal $ char, [^] is either an error or, as in ECMAScript regex flavor, any char. Same with \z, \Z, \A anchors.
You may solve the problem using any of the below patterns:
[&?]list=([^&]*)
[&?]list=(.*?)(?=&|$)
[&?]list=(.*?)(?![^&])
If you need to check for the "absolute", unambiguous string end anchor, you need to remember that is various regex flavors, it is expressed with different constructs:
[&?]list=(.*?)(?=&|$) - OK for ECMA regex (JavaScript, default C++ `std::regex`)
[&?]list=(.*?)(?=&|\z) - OK for .NET, Go, Onigmo (Ruby), Perl, PCRE (PHP, base R), Boost, ICU (R `stringr`), Java/Andorid
[&?]list=(.*?)(?=&|\Z) - OK for Python
Matching between a char sequence and a single char or end of string (current scenario)
The .*?([YOUR_SINGLE_CHAR_DELIMITER(S)]|$) pattern (suggested by João Silva) is rather inefficient since the regex engine checks for the patterns that appear to the right of the lazy dot pattern first, and only if they do not match does it "expand" the lazy dot pattern.
In these cases it is recommended to use negated character class (or bracket expression in the POSIX talk):
[&?]list=([^&]*)
See demo. Details
[&?] - a positive character class matching either & or ? (note the relationships between chars/char ranges in a character class are OR relationships)
list= - a substring, char sequence
([^&]*) - Capturing group #1: zero or more (*) chars other than & ([^&]), as many as possible
Checking for the trailing single char delimiter presence without returning it or end of string
Most regex flavors (including JavaScript beginning with ECMAScript 2018) support lookarounds, constructs that only return true or false if there patterns match or not. They are crucial in case consecutive matches that may start and end with the same char are expected (see the original pattern, it may match a string starting and ending with &). Although it is not expected in a query string, it is a common scenario.
In that case, you can use two approaches:
A positive lookahead with an alternation containing positive character class: (?=[SINGLE_CHAR_DELIMITER(S)]|$)
A negative lookahead with just a negative character class: (?![^SINGLE_CHAR_DELIMITER(S)])
The negative lookahead solution is a bit more efficient because it does not contain an alternation group that adds complexity to matching procedure. The OP solution would look like
[&?]list=(.*?)(?=&|$)
or
[&?]list=(.*?)(?![^&])
See this regex demo and another one here.
Certainly, in case the trailing delimiters are multichar sequences, only a positive lookahead solution will work since [^yes] does not negate a sequence of chars, but the chars inside the class (i.e. [^yes] matches any char but y, e and s).

Regex matching character in substring and excluding trailing characters [duplicate]

I have a string. The end is different, such as index.php?test=1&list=UL or index.php?list=UL&more=1. The one thing I'm looking for is &list=.
How can I match it, whether it's in the middle of the string or it's at the end? So far I've got [&|\?]list=.*?([&|$]), but the ([&|$]) part doesn't actually work; I'm trying to use that to match either & or the end of the string, but the end of the string part doesn't work, so this pattern matches the second example but not the first.
Use:
/(&|\?)list=.*?(&|$)/
Note that when you use a bracket expression, every character within it (with some exceptions) is going to be interpreted literally. In other words, [&|$] matches the characters &, |, and $.
In short
Any zero-width assertions inside [...] lose their meaning of a zero-width assertion. [\b] does not match a word boundary (it matches a backspace, or, in POSIX, \ or b), [$] matches a literal $ char, [^] is either an error or, as in ECMAScript regex flavor, any char. Same with \z, \Z, \A anchors.
You may solve the problem using any of the below patterns:
[&?]list=([^&]*)
[&?]list=(.*?)(?=&|$)
[&?]list=(.*?)(?![^&])
If you need to check for the "absolute", unambiguous string end anchor, you need to remember that is various regex flavors, it is expressed with different constructs:
[&?]list=(.*?)(?=&|$) - OK for ECMA regex (JavaScript, default C++ `std::regex`)
[&?]list=(.*?)(?=&|\z) - OK for .NET, Go, Onigmo (Ruby), Perl, PCRE (PHP, base R), Boost, ICU (R `stringr`), Java/Andorid
[&?]list=(.*?)(?=&|\Z) - OK for Python
Matching between a char sequence and a single char or end of string (current scenario)
The .*?([YOUR_SINGLE_CHAR_DELIMITER(S)]|$) pattern (suggested by João Silva) is rather inefficient since the regex engine checks for the patterns that appear to the right of the lazy dot pattern first, and only if they do not match does it "expand" the lazy dot pattern.
In these cases it is recommended to use negated character class (or bracket expression in the POSIX talk):
[&?]list=([^&]*)
See demo. Details
[&?] - a positive character class matching either & or ? (note the relationships between chars/char ranges in a character class are OR relationships)
list= - a substring, char sequence
([^&]*) - Capturing group #1: zero or more (*) chars other than & ([^&]), as many as possible
Checking for the trailing single char delimiter presence without returning it or end of string
Most regex flavors (including JavaScript beginning with ECMAScript 2018) support lookarounds, constructs that only return true or false if there patterns match or not. They are crucial in case consecutive matches that may start and end with the same char are expected (see the original pattern, it may match a string starting and ending with &). Although it is not expected in a query string, it is a common scenario.
In that case, you can use two approaches:
A positive lookahead with an alternation containing positive character class: (?=[SINGLE_CHAR_DELIMITER(S)]|$)
A negative lookahead with just a negative character class: (?![^SINGLE_CHAR_DELIMITER(S)])
The negative lookahead solution is a bit more efficient because it does not contain an alternation group that adds complexity to matching procedure. The OP solution would look like
[&?]list=(.*?)(?=&|$)
or
[&?]list=(.*?)(?![^&])
See this regex demo and another one here.
Certainly, in case the trailing delimiters are multichar sequences, only a positive lookahead solution will work since [^yes] does not negate a sequence of chars, but the chars inside the class (i.e. [^yes] matches any char but y, e and s).

Python reference to regex in parentheses

I have a text file that needs to have the letter 't' removed if it is not immediately preceded by a number.
I am trying to do this using re.sub and I have this:
f=open('File.txt').read()
g=f
g=re.sub('([^0-9])t','',g)
This identifies the letters to be removed correctly but also removes the preceding character. How can I refer to the parenthesized regex in the replacement String?
Thanks!
Use a lookbehind (or negative lookbehind) instead.
g=re.sub('(?<=[^0-9])t','',g)
or
g=re.sub('(?<![0-9])t','',g)
Three options:
g=re.sub('([^0-9])t','\\1',g)
or
g=re.sub('(?<=[^0-9])t','',g)
or
g=re.sub('(?<![0-9])t','',g)
The first option is what you are looking for, a backreference to the captured string. \\1 will refer to the first captured group.
Lookarounds don't consume characters, so you don't need to replace them back. Here, I have used a positive lookbehind for the first one and a negative lookbehind for the second one. Those don't consume the characters within their brackets, so you are not taking the [^0-9] or [0-9] in the replacement. It might be better to use those since it prevents overlapping matches.
The positive lookbehind makes sure that t has a non-digit character before it. The negative lookbehind makes sure that t does not have a digit character before it.

Joinning two regular expressions together

I have two regular expressions, one matching for all characters [a-z] and the other excluding the following combination of characters [^spuz(ih)] (the characters s, p, u, z, ih)how would I combine these two so that I could allow all alphanumeric characters except those listed in the second RE?
(re.match(r'^[a-z]*(?![spuz]|ih)[a-z]s$', insert_phrase)
You can't "combine" them as such, but you can write another regular expression which has the same effect. For this, you can use the (?!) construct. It matches 0 characters only if the regular expression in it is not matched by the following part. So you can use:
'(?![spuz(ih)])[a-z]'
Or, since this wasn't what you wanted, change it to:
'(?![spuz]|ih)[a-z]'
In the changed question, you seem to want negative lookbehind instead. This turns the pattern into:
'^[a-z]*(?<![a-z][spuz]|ih)s$'
Note the extra [a-z] in the lookbehind part. It is required because lookbehind expressions must be fixed width. This means that a string like 'ps' will match the pattern, but you don't want that. So instead, it's better to use two separate lookbehinds (both of which have to be be true for the string to match):
'^[a-z]*(?<![spuz])(?<!ih)s$'

How to match alphabetical chars without numeric chars with Python regexp?

Using Python module re, how to get the equivalent of the "\w" (which matches alphanumeric chars) WITHOUT matching the numeric characters (those which can be matched by "[0-9]")?
Notice that the basic need is to match any character (including all unicode variation) without numerical chars (which are matched by "[0-9]").
As a final note, I really need a regexp as it is part of a greater regexp.
Underscores should not be matched.
EDIT:
I hadn't thought about underscores state, so thanks for warnings about this being matched by "\w" and for the elected solution that addresses this issue.
You want [^\W\d]: the group of characters that is not (either a digit or not an alphanumeric). Add an underscore in that negated set if you don't want them either.
A bit twisted, if you ask me, but it works. Should be faster than the lookahead alternative.
(?!\d)\w
A position that is not followed by a digit, and then \w. Effectively cancels out digits but allows the \w range by using a negative look-ahead.
The same could be expressed as a positive look-ahead and \D:
(?=\D)\w
To match multiple of these, enclose in parens:
(?:(?!\d)\w)+

Categories

Resources