Finding all words: Negative Look Behind in Regex

Finding all words: Negative Look Behind in Regex - python

I am currently using Python 2.7 (I'm working with some old code of mine). And I am trying to get all words via regex, where I can ignore words with apostrophes, like can't and Gary's. So far I have made all letters in the string lowercase and here's my current regex:
r"(?<=\s|^)([a-z]+)(?=\s|$)"
I get the following error:
raise error, v # invalid expression
error: look-behind requires fixed-width pattern
I also tried:
r"(?:\s|^)([a-z]+)(?=\s|$)"
But, as you can see on Regex101, it doesn't capture the last word.
I know that there are probably better alternatives to doing this, but now I am really curious as to how to do a negative look behind in this situation. However, if you could explain that and offer your own better solution, that'd be fine and appreciated.

In this case, just use a negative lookbehind with the opposite character class \S (same can be done with the lookahead):
r"(?<!\S)([a-z]+)(?!\S)"
See the regex demo.
A "positive" approach will look less pretty:
r"(?:(?<=\s)|^)([a-z]+)(?=\s|$)"
See another regex demo. The (?:(?<=\s)|^) non-capturing group combines 2 zero-width assertion alternatives, (?<=\s) that requires a whitespace before the current location, and ^, matching the start of string.

Related

REGEX: Negative lookbehind with multiple whitespaces [duplicate]

I am trying to use lookbehinds in a regular expression and it doesn't seem to work as I expected. So, this is not my real usage, but to simplify I will put an example. Imagine I want to match "example" on a string that says "this is an example". So, according to my understanding of lookbehinds this should work:
(?<=this\sis\san\s*?)example
What this should do is find "this is an", then space characters and finally match the word "example". Now, it doesn't work and I don't understand why, is it impossible to use '+' or '*' inside lookbehinds?
I also tried those two and they work correctly, but don't fulfill my needs:
(?<=this\sis\san\s)example
this\sis\san\s*?example
I am using this site to test my regular expressions: http://gskinner.com/RegExr/

Many regular expression libraries do only allow strict expressions to be used in look behind assertions like:
only match strings of the same fixed length: (?<=foo|bar|\s,\s) (three characters each)
only match strings of fixed lengths: (?<=foobar|\r\n) (each branch with fixed length)
only match strings with a upper bound length: (?<=\s{,4}) (up to four repetitions)
The reason for these limitations are mainly because those libraries can’t process regular expressions backwards at all or only a limited subset.
Another reason could be to avoid authors to build too complex regular expressions that are heavy to process as they have a so called pathological behavior (see also ReDoS).
See also section about limitations of look-behind assertions on Regular-Expressions.info.

Hey if your not using python variable look behind assertion you can trick the regex engine by escaping the match and starting over by using \K.
This site explains it well .. http://www.phpfreaks.com/blog/pcre-regex-spotlight-k ..
But pretty much when you have an expression that you match and you want to get everything behind it using \K will force it to start over again...
Example:
string = '<a this is a tag> with some information <div this is another tag > LOOK FOR ME </div>'
matching /(\<a).+?(\<div).+?(\>)\K.+?(?=\<div)/ will cause the regex to restart after you match the ending div tag so the regex won't include that in the result. The (?=\div) will make the engine get everything in front of ending div tag

What Amber said is true, but you can work around it with another approach: A non-capturing parentheses group
(?<=this\sis\san)(?:\s*)example
That make it a fixed length look behind, so it should work.

You can use sub-expressions.
(this\sis\san\s*?)(example)
So to retrieve group 2, "example", $2 for regex, or \2 if you're using a format string (like for python's re.sub)

Most regex engines don't support variable-length expressions for lookbehind assertions.

Regular expression pattern questions?

I am having a hard time understanding regular expression pattern. Could someone help me regular expression pattern to match all words ending in s. And start with a and end with a (like ana).
How do I write ending?

Word boundaries are given by \b so the following regex matches words ending with ing or s: "\b(\w+?(?:ing|s))\b" where as \b is a word boundary, \w+ is one or more "word character" and (?:ing|s) is an uncaptured group of either ing or s.
As you asked "how to develop a regex":
First: Don't use regex for complex tasks. They are hard to read, write and maintain. For example there is a regex that validates email addresses - but its computer generated and nothing you should use in practice.
Start simple and add edge cases. At the beginning plan what characters you need to use: You said you need words ending with s or ing. So you probably need something to represent a word, endings of words and the literal characters s and ing. What is a word? This might change from case to case, but at least every alphabetical character. Looking up in the python documentation on regexes you can find \w which is [a-zA-Z0-9_], which fits my impression of a word character. There you can also find \b which is a word boundary.
So the "first pseudo code try" is something like \b\w...\w\b which matches a word. We still need to "formalize" ... which we want to have the meaning of "one ore more characters", which directly translates to \b\w+\b. We can now match a word! We still need the s or ing. | translates to or, so how is the following: \b\w+ing|s\b? If you test this, you'll see that it will match confusing things like ingest which should not match our regex. What is happening? As you probably already saw the | can't know "which part it should or", so we need to introduce parenthesis: \b\w+(ing|s)\b. Congratulations, you have now arrived at a working regex!
Why (and how) does this differ from the example I gave first? First I wrote \w+? instead of \w+, the ? turns the + into a non-greedy version. If you know what the difference between greedy and non greedy is, skip this paragraph. Consider the following: AaAAbA and we want to match the things enclosed with big letter A. A naive try: A\w+A, so one or more word characters enclosed with A. This matches AaA, but also AaAAbA, A is still something that can be matched by \w. Without further config the *+? quantifier all try to match as much as possible. Sometimes, like in the A example, you don't want that, you can then use a ? after the quantifier to signal you want a non-greedy version, a version that matches as little as possible.
But in our case this isn't needed, the words are well seperated by whitespaces, which are not part of \w. So in fact you can just let + be greedy and everything will be alright. If you use . (any character) you often need to be careful not to match to much.
The other difference is using (?:s|ing) instead of (s|ing). What does the ?: do here? It changes a capturing group to a non capturing group. Generally you don't want to get "everything" from the regex. Consider the following regex: I want to go to \w+. You are not interested in the whole sentence, but only in the \w+, so you can capture it in a group: I want to go to (\w+). This means that you are interested in this specific piece of information and want to retrieve it later. Sometimes (like when using |) you need to group expressions together, but are not interested in their content, you can then declare it as non capturing. Otherwise you will get the group (s or ing) but not the actual word!
So to summarize:
* start small
* add one case after another
* always test with examples
In fact I just tried re.findall(\b\w+(?:ing|s)\b, "fishing words") and it didn't work. \w+(?:ing|s) works. I've no idea why, maybe someone else can explain that. Regex are an arcane thing, only use them for easy and easy to test tasks.

Generally speaking I'd use \b to match "word boundaries" with \w which matches word components (short cut for [A-Za-z0-9_]). Then you can do an or grouping to match "s" or "ing". Result is:
/\b\w+(s|ing)\b/

Python Regex: force greedy match using alternation

I have a regex of the form:
a(bc|de|def)g?
On the string adefg this pattern is matching only up to "ade" and it is clearly quitting on the first match in the alternation group. Removing the ? option from the "g" token allows the pattern to match the entire string. This makes sense since the "?" is non-greedy. [EDIT: I have been corrected, the "?" is greedy, which just seems to add to my confusion. It seemed to me that if the "?" were non-greedy, this was allowing the pattern to quit early when a larger match was available.]
I would like to avoid rearranging the order of the strings in the alternation, and I can solve the problem as is by appending (\b|$) to the pattern, but now I am really curious to know if there are other solutions
For instance, is there any way to make the "?" greedy or to force the alternation not to quit on the first match?

You can't make the | not match its constituents left to right, because matching left to right is its documented behavior. Even if you could make the ? "greedy", it wouldn't work, because the regex matches from beginning to end, so the greediness of the ? couldn't have an effect until after the alternation had already matched.
Greediness doesn't make the regex engine go back to find a "better way" to match; it will match the first way it can. It will only make use of the g? if it has to do so in order for the entire match to succeed, and it won't have to if it can just ignore it and stick with what it matched in the alternation. In other words, once it matches "ade", it can succeed and stop (because it doesn't need to match the "g", since it's optional). It therefore doesn't even consider the other parts of the alternation, since it can find a way to make it work using the first one. A greedy ? doesn't make it go back and retry other things it already matched unless it needs to for the entire match to succeed.
If you are using an alternation where some alternants are substrings of others, you should put them in order so the longest ones come first.
Another possibility is to add a $ to the end of your regex. This will force it to go all the way to the end of the string, so it will backtrack and try the other alternatives, because now "ade" won't be a match (since it doesn't match the $). However, this will only work if you really do want to match the whole string.

You can usually use a negative lookahead, but I don't know the capabilities of Python's regex engine.
a(bc|de(?!f)|def)g?
check here

An obvious way to refactor this expression would be to "unroll" the optional part:
a(bc|de|def)g|a(bc|de|def)
or
(a(bc|de|def))g|\1
to avoid the repetition.

multiple negative lookahead assertions

I can't figure out how to do multiple lookaround for the life of me. Say I want to match a variable number of numbers following a hash but not if preceded by something or followed by something else. For example I want to match #123 or #12345 in the following. The lookbehinds seem to be fine but the lookaheads do not. I'm out of ideas.
matches = ["#123", "This is #12345",
# But not
"bad #123", "No match #12345", "This is #123-ubuntu",
"This is #123 0x08"]
pat = '(?<!bad )(?<!No match )(#[0-9]+)(?! 0x0)(?!-ubuntu)'
for i in matches:
print i, re.search(pat, i)

You should have a look at the captures as well. I bet for the last two strings you will get:
#12
This is what happens:
The engine checks the two lookbehinds - they don't match, so it continues with the capturing group #[0-9]+ and matches #123. Now it checks the lookaheads. They fail as desired. But now there's backtracking! There is one variable in the pattern and that is the +. So the engine discards the last matched character (3) and tries again. Now the lookaheads are no problem any more and you get a match. The simplest way to solve this is to add another lookahead that makes sure that you go to the last digit:
pat = r'(?<!bad )(?<!No match )(#[0-9]+)(?![0-9])(?! 0x0)(?!-ubuntu)'
Note the use of a raw string (the leading r) - it doesn't matter in this pattern, but it's generally a good practice, because things get ugly once you start escaping characters.
EDIT: If you are using or willing to use the regex package instead of re, you get possessive quantifiers which suppress backtracking:
pat = r'(?<!bad )(?<!No match )(#[0-9]++)(?! 0x0)(?!-ubuntu)'
It's up to you which you find more readable or maintainable. The latter will be marginally more efficient, though. (Credits go to nhahtdh for pointing me to the regex package.)

Python re: negating part of a regular expression

Perhaps a silly question, but though google returned lots of similar cases, I could not find this exact situation: what regular expression will match all string NOT containing a particular string. For example, I want to match any string that does not contain 'foo_'.
Now,
re.match('(?<!foo_).*', 'foo_bar')
returns a match. While
re.match('(?<!foo_)bar', 'foo_bar')
does not.
I tried the non-greedy version:
re.match('(?<!foo_).*?', 'foo_bar')
still returns a match.
If I add more characters after the ),
re.search('(?<!foo_)b.*', 'foo_bar')
it returns None, but if the target string has more trailing chars:
re.search('(?<!foo_)b.*', 'foo_barbaric')
it returns a match.
I intentionally kept out the initial .* or .*? in the re. But same thing happens with that.
Any ideas why this strange behaviour? (I need this as a single regular expression - to be entered as a user input).

You're using lookbehind assertions where you need lookahead assertions:
re.match(r"(?!.*foo_).*", "foo_bar")
would work (i. e. not match).
(?!.*foo_) means "Assert that it is impossible to match .*foo_ from the current position in the string. Since you're using re.match(), that position is automatically defined as the start of the string.

Try this pattern instead:
^(?!.*foo_).*
This uses the ^ metacharacter to match from the beginning of the string, and then uses a negative look-ahead that checks for "foo_". If it exists, the match will fail.
Since you gave examples using both re.match() and re.search(), the above pattern would work with both approaches. However, when you're using re.match() you can safely omit the usage of the ^ metacharacter since it will match at the beginning of the string, unlike re.search() which matches anywhere in the string.

I feel like there is a good chance that you could just design around this with a conditional statement.
(It would be nice if we knew specifically what you're trying to accomplish).
Why not:
if not re.match("foo", something):
do_something
else:
print "SKipping this"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.