Why does this take so long to match? Is it a bug? - python

I need to match certain URLs in web application, i.e. /123,456,789, and wrote this regex to match the pattern:
r'(\d+(,)?)+/$'
I noticed that it does not seem to evaluate, even after several minutes when testing the pattern:
re.findall(r'(\d+(,)?)+/$', '12345121,223456,123123,3234,4523,523523')
The expected result would be that there were no matches.
This expression, however, executes almost immediately (note the trailing slash):
re.findall(r'(\d+(,)?)+/$', '12345121,223456,123123,3234,4523,523523/')
Is this a bug?

There is some catastrophic backtracking going on that will cause an exponential amount of processing depending on how long the non-match string is. This has to do with your nested repetitions and optional comma (even though some regex engines can determine that this wouldn't be a match with attempting all of the extraneous repetition). This is solved by optimizing the expression.
The easiest way to accomplish this is to just look for 1+ digits or commas followed by a slash and the end of the string: [\d,]+/$. However, that is not perfect since it would allow for something like ,123,,4,5/.
For this you can use a slightly optimized version of your initial try: (?:\d,?)+/$. First, I made your repeating group non-capturing ((?:...)) which isn't necessary but it provides for a "cleaner match". Next, and the only crucial step, I stopped repeating the \d inside of the group since the group is already repeating. Finally, I removed the unnecessary group around the optional , since ? only affects the last character. Pretty much this will look for one digit, maybe a comma, then repeat, and finally followed by a trailing /.
This can still match an odd string 1,2,3,/, so for the heck of it I improved your original regex with a negative lookbehind: (?:\d,?)+(?<!,)/$. This will assert that there is no comma directly before the trailing /.

First off, I must say that this is not a BUG. What your regex is doing is that it's trying all the possibilities due to the nested repeating patters. Sometimes this process can gobble up a lot of time and resources and when it gets really bad, it’s called catastrophic backtracking.
This is the code of findall function in python source code:
def findall(pattern, string, flags=0):
"""Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result."""
return _compile(pattern, flags).findall(string)
As you see it just use the compile() function, so based on _compile() function that actually use Traditional NFA that python use for its regex matching, and base on
this short explain about backtracking in regular expression in Mastering Regular Expressions, Third Edition, by Jeffrey E. F. Friedl!
The essence of an NFA engine is this: it considers each subexpression or component in turn, and whenever it needs to decide between two equally viable options,
it selects one and remembers the other to return to later if need be.
Situations where it has to decide among courses of action include anything with a
quantifier (decide whether to try another match), and alternation (decide which
alter native to try, and which to leave for later).
Whichever course of action is attempted, if it’s successful and the rest of the regex
is also successful, the match is finished. If anything in the rest of the regex eventually causes failure, the regex engine knows it can backtrack to where it chose the
first option, and can continue with the match by trying the other option. This way,
it eventually tries all possible permutations of the regex (or at least as many as
needed until a match is found).
Let's go inside your pattern: So you have r'(\d+(,)?)+/$' with this string '12345121,223456,123123,3234,4523,523523' we have this steps:
At first, the first part of your string (12345121) is matched with \d+, then , is matched with (,)? .
Then based on first step, the whole string is match due to + after the grouping ((\d+(,)?)+)
Then at the end, there is nothing for /$ to be matched. Therefore, (\d+(,)?)+ needs to "backtrack" to one character before the last for check for /$. Again, it don't find any proper match, so after that it's (,)'s turn to backtrack, then \d+ will backtrack, and this backtracking will be continue to end till it return None.
So based on the length of your string it takes time, which in this case is very high, and it create a nested quantifiers entirely!
As an approximately benchmarking, in this case, you have 39 character so you need 3^39 backtracking attempts (we have 3 methods for backtrack).
Now for better understanding, I measure the runtime of the program while changing the length of the string:
'12345121,223456,123123,3234,4523,' 3^33 = 5.559060567×10¹⁵
~/Desktop $ time python ex.py
real 0m3.814s
user 0m3.818s
sys 0m0.000s
'12345121,223456,123123,3234,4523,5' 3^24 = 1.66771817×10¹⁶ #X2 before
~/Desktop $ time python ex.py
real 0m5.846s
user 0m5.837s
sys 0m0.015s
'12345121,223456,123123,3234,4523,523' 3^36= 1.500946353×10¹⁷ #~10X before
~/Desktop $ time python ex.py
real 0m15.796s
user 0m15.803s
sys 0m0.008s
So to avoid this problem you can use one of the below ways:
Atomic grouping (Currently doesn't support in Python, A RFE was created to add it to Python 3)
Reduction the possibility of backtracking by breaking the nested groups to separate regexes.

To avoid the catastrophic backtracking I suggest
r'\d+(,\d+)*/$'

Related

Regex Statement to only match parts of a string for comparison - Python

What I am trying to do is match values from one file to another, but I only need to match the first portion of the string and the last portion.
I am reading each file into a list, and manipulating these based on different Regex patterns I have created. Everything works, except when it comes to these type of values:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
In this example, I only want to match 'V-1\ZDS\R\EMBO-20' and then compare the '24' value at the end of the string. The number x in '20-x:', can vary and doesn't matter in terms of comparisons, as long as the first and last parts of this string match.
This is the Regex I am using:
re.compile(r"(?:.*V-1\\ZDS\\R\\EMBO-20-\d.*)(:\d*\w.*)")
Once I filter down the list, I use the following function to return the difference between the two sets:
funcDiff = lambda x, y: list((set(x)- set(y))) + list((set(y)- set(x)))
Is there a way to take the list of differences and filter out the ones that have matching values after the
:
as mentioned above?
I apologize is this is an obvious answer, I'm new to Python and Regex!
The output I get is the differences between the entire strings, so even if the first and last part of the string match, if the number following the 'EMBO-20-x' doesn't also match, it returns it as being different.
Before discussing your question, regex101 is an incredibly useful tool for this type of thing.
Your issue stems from two issues:
1.) The way you used .*
2.) Greedy vs. Nongreedy matches
.* kinda sucks
.* is a regex expression that is very rarely what you actually want.
As a quick aside, a useful regex expression is [^c]* or [^c]+. These expressions match any character except the letter c, with the first expression matching 0 or more, and the second matched 1 or more.
.* will match all characters as many times as it can. Instead, try to start your regex patterns with more concrete starting points. Two good ways to do this are lookbehind expressions and anchors.
Another quick aside, it's likely that you are misusing regex.match and regex.find. match will only return a match that begins at the start of the string, while find will return matches anywhere in the input string. This could be the reason you included the .* in the first place, to allow a .match call to return a match deeper in the string.
Lookbehind Expressions
There are more complete explanations online, but in short, regex patterns like:
(?<=test)foo
will match the text foo, but only if test is right in front of it. To be more clear, the following strings will not match that regex:
foo
test-foo
test foo
but the following string will match:
testfoo
This will only match the text foo, though.
Anchors
Another option is anchors. ^ and $ are special characters, matching the start and end of a line of text. If you know your regex pattern will match exactly one line of text, start it with ^ and end it with $.
Leading patterns with .* and ending with .* are likely the source of your issue. Although you did not include full examples of your input or your code, you likely used match as opposed to find.
In regex, . matches any character, and * means 0 or more times. This means that for any input, your pattern will match the entire string.
Greedy vs. Non-Greedy qualifiers
The second issue is related to greediness. When your regex patterns have a * in them, they can match 0 or more characters. This can hide problems, as entire * expressions can be skipped. Your regex is likely matched several lines of text as one match, and hiding multiple records in a single .*.
The Actual Answer
Taking all of this in to consideration, let's assume that your input data looks like this:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
V-1\ZDS\R\EMBO-20-3:93
V-1\ZDS\R\EMBO-20-6:22309
V-1\ZDS\R\EMBO-20-8:2238
V-1\ZDS\R\EMBO-20-3:28
A better regular expression would be:
^V-1\\ZDS\\R\\EMBO-20-\d:(\d+)$
To visualize this regex in action, follow this link.
There are several differences I would like to highlight:
Starting the expression with ^ and ending with $. This forces the regex to match exactly one line. Even though the pattern works without these characters, it's good practice when working with regex to be as explicit as possible.
No useless non-capturing group. Your example had a (?:) group at the start. This denotes a group that does not capture it's match. It's useful if you want to match a subpattern multiple times ((?:ab){5} matches ababababab without capturing anything). However, in your example, it did nothing :)
Only capturing the number. This makes it easier to extract the value of the capture groups.
No use of *, one use of +. + works like *, but it matches 1 or more. This is often more correct, as it prevents 'skipping' entire characters.

Understanding regex pattern used to find string between strings in html

I have the following html file:
<!-- <div class="_5ay5"><table class="uiGrid _51mz" cellspacing="0" cellpadding="0"><tbody><tr class="_51mx"><td class="_51m-"><div class="_u3y"><div class="_5asl"><a class="_47hq _5asm" href="/Dev/videos/1610110089242029/" aria-label="Who said it?" ajaxify="/Dev/videos/1610110089242029/" rel="theater">
In order to pull the string of numbers between videos/ and /", I'm using the following method that I found:
import re
Source_file = open('source.html').read()
result = re.compile('videos/(.*?)/"').search(Source_file)
print result
I've tried Googling an explanation for exactly how the (.*?) works in this particular implementation, but I'm still unclear. Could someone explain this to me? Is this what's known as a "non-greedy" match? If yes, what does that mean?
The ? in this context is a special operator on the repetition operators (+, *, and ?). In engines where it is available this causes the repetition to be lazy or non-greedy or reluctant or other such terms. Typically repetition is greedy which means that it should match as much as possible. So you have three types of repetition in most modern perl-compatible engines:
.* # Match any character zero or more times
.*? # Match any character zero or more times until the next match (reluctant)
.*+ # Match any character zero or more times and don't stop matching! (possessive)
More information can be found here: http://www.regular-expressions.info/repeat.html#lazy for reluctant/lazy and here: http://www.regular-expressions.info/possessive.html for possessive (which I'll skip discussing in this answer).
Suppose we have the string aaaa. We can match all of the a's with /(a+)a/. Literally this is
match one or more a's followed by an a.
This will match aaaa. The regex is greedy and will match as many a's as possible. The first submatch is aaa.
If we use the regex /(a+?)a this is
reluctantly match one or more as followed by an a
or
match one or more as until we reach another a
That is, only match what we need. So in this case the match is aa and the first submatch is a. We only need to match one a to satisfy the repetition and then it is followed by an a.
This comes up a lot when using regex to match within html tags, quotes and the suchlike -- usually reserved for quick and dirty operations. That is to say using regex to extract from very large and complex html strings or quoted strings with escape sequence can cause a lot of problems but it's perfectly fine for specific use cases. So in your case we have:
/Dev/videos/1610110089242029/
The expression needs to match videos/ followed by zero or more characters followed by /". If there is only one videos URL there that's just fine without being reluctant.
However we have
/videos/1610110089242029/" ... ajaxify="/Dev/videos/1610110089242029/"
Without reluctance, the regex will match:
1610110089242029/" ... ajaxify="/Dev/videos/1610110089242029
It tries to match as much as possible and / and " satisfy . just fine. With reluctance, the matching stops at the first /" (actually it backtracks but you can read about that separately). Thus you only get the part of the url you need.
It can be explained in a simple way:
.: match anything (any character),
*: any number of times (at least zero times),
?: as few times as possible (hence non-greedy).
videos/(.*?)/"
as a regular expression matches (for example)
videos/1610110089242029/"
and the first capturing group returns 1610110089242029, because any of the digits is part of “any character” and there are at least zero characters in it.
The ? causes something like this:
videos/1610110089242029/" something else … "videos/2387423470237509/"
to properly match as 1610110089242029 and 2387423470237509 instead of as 1610110089242029/" something else … "videos/2387423470237509, hence “as few times as possible”, hence “non-greedy”.
The . means any character. The * means any number of times, including zero. The ? does indeed mean non-greedy; that means that it will try to capture as few characters as possible, i.e., if the regex encounters a /, it could match it with the ., but it would rather not because the . is non-greedy, and since the next character in the regex is happy to match /, the . doesn't have to. If you didn't have the ?, that . would eat up the whole rest of the file because it would be chomping at the bit to match as many things as possible, and since it matches everything, it would go on forever.

Regex taking forever on short string.

I'm looking through a bunch of strings, trying to match some with the following pattern.
location_pattern = re.compile( r"""
\b
(?P<location>
([A-Z]\w*[ -]*)+[, ]+
(
[A-Z]{2}
|
[A-Z]\w+\ *\d ##
)
)
\b
""", flags=re.VERBOSE)
Now this regex runs ok on almost all my data set, but it takes forever (well, 5 seconds) on this particular string:
' JAVASCRIPT SOFTWARE ARCHITECT, SUCCESSFUL SERIAL'
There are a bunch of strings like this one (all caps, lots of space chars) at some point in my input data and the program is greatly slowed when it hits it. I tried taking out different parts of the regex, and it turns out the culprit is the
\ *\d at the end of the commented line.
I'd like to understand how this is causing the regex validation to take so long.
Can anyone help?
The reason why removing \ *\d works is because you just turn the example in the question from a non-matching case into a matching case. In a backtracking engine, a matching case usually takes (much) less time than a non-matching case, since in non-matching case, the engine must exhaust the search space to come to that conclusion.
Where the problem lies is correctly pointed out by Fede (the explanation is rather hand-waving and inaccurate, though).
([A-Z]\w*[ -]*)+
Since [ -]* is optional and \w can match [A-Z], the regex above degenerates to ([A-Z][A-Z]*)+, which matches the classic example of catastrophic backtracking (A*)*. The degenerate form also shows that the problem manifests on long strings of uppercase letter.
The fragment by itself doesn't cause much harm. However, as long as the sequel (whatever follows after the fragment above) fails, it will cause catastrophic backtracking.
Here is one way to rewrite it (without knowing your exact requirement):
[A-Z]\w*(?:[ -]+[A-Z]\w*)*[ -]*
By forcing [ -]+ to be at least once, the pattern can no longer match a string of uppercase letter in multiple ways.
Additionally to Greg's answer, you have also this pattern:
([A-Z]\w*[ -]*)+
^----^-^---- Note embedded quantifiers!
Using quantifiers inside a repeated group (even worse 2 quantifiers as you have) usually generates catastrophic backtracking issues. Hence, I'd re think the regex.
Comment: I can offer you another regex if you add more sample data and your expected output by updating my answer later.

Python Regex: force greedy match using alternation

I have a regex of the form:
a(bc|de|def)g?
On the string adefg this pattern is matching only up to "ade" and it is clearly quitting on the first match in the alternation group. Removing the ? option from the "g" token allows the pattern to match the entire string. This makes sense since the "?" is non-greedy. [EDIT: I have been corrected, the "?" is greedy, which just seems to add to my confusion. It seemed to me that if the "?" were non-greedy, this was allowing the pattern to quit early when a larger match was available.]
I would like to avoid rearranging the order of the strings in the alternation, and I can solve the problem as is by appending (\b|$) to the pattern, but now I am really curious to know if there are other solutions
For instance, is there any way to make the "?" greedy or to force the alternation not to quit on the first match?
You can't make the | not match its constituents left to right, because matching left to right is its documented behavior. Even if you could make the ? "greedy", it wouldn't work, because the regex matches from beginning to end, so the greediness of the ? couldn't have an effect until after the alternation had already matched.
Greediness doesn't make the regex engine go back to find a "better way" to match; it will match the first way it can. It will only make use of the g? if it has to do so in order for the entire match to succeed, and it won't have to if it can just ignore it and stick with what it matched in the alternation. In other words, once it matches "ade", it can succeed and stop (because it doesn't need to match the "g", since it's optional). It therefore doesn't even consider the other parts of the alternation, since it can find a way to make it work using the first one. A greedy ? doesn't make it go back and retry other things it already matched unless it needs to for the entire match to succeed.
If you are using an alternation where some alternants are substrings of others, you should put them in order so the longest ones come first.
Another possibility is to add a $ to the end of your regex. This will force it to go all the way to the end of the string, so it will backtrack and try the other alternatives, because now "ade" won't be a match (since it doesn't match the $). However, this will only work if you really do want to match the whole string.
You can usually use a negative lookahead, but I don't know the capabilities of Python's regex engine.
a(bc|de(?!f)|def)g?
check here
An obvious way to refactor this expression would be to "unroll" the optional part:
a(bc|de|def)g|a(bc|de|def)
or
(a(bc|de|def))g|\1
to avoid the repetition.

Is this regex correct for xsd:anyURI

I am implementing a function (in Python) that checks for conformance of the string to xsd:anyURI.
According to Schema Central it only makes sense to check for repeated, consecutive and non-consecutive # characters and % followed by something other than hex characters 0-Ff.
So far, I have something like and it seems to be working:
if uri.search('(%[^0-9A-Fa-f]+)|(#.*#+)')
The second expression for multiple '#' signs may be faulty.
If you are aiming for an exclusion regex according to the Schema Central parser requirement, you are almost there. The first half, excluding percent signs not followed by two hexadecimal digits is best solved using a negative look-ahead assertion; the second half is fine, though you can ditch the last repeat indicator without affecting your results:
(%(?![0-9A-F]{2})|#.*#)
Compile your regex with case independence (i flag) and you are good to go.
Recommended reading: the Python Standard Library’s chapter on Regular Expression Operation Syntax.
I recently had to do this without a negative lookahead, and the following seems to work:
(%.?[^0-9A-Fa-f]|#.*#)

Categories

Resources