Why Python chokes on this regex? - python

This looks like a simple regex, no backreferences, no "any" characters, I'd even dare to say it's parseable by a Thomson DFA and all. It even works, but chokes on very simple non-matches.
{\s*?
ngx_string\("(?P<name>[a-z0-9_]+)"\)\s*?,\s*?
(?P<where>(([A-Z0-9_]+)\s*\|?)+?)\s*?,\s*?
(?P<bla>[^\n}]+?)\s*?,\s*?
(?P<bla2>[^\n}]+?)\s*?,\s*?
(?P<bla3>[^\n}]+?)\s*?,\s*?
(?P<bla4>[^\n}]+?)\s*?
}
+ re.MULTILINE | re.VERBOSE
runnable gist here
I'm currently trying this on Python 2.7.8 (but the linked gist fails on py3.4 too; also linux, x86-64, Ubuntu, PCRE statically linked in [at least /proc//maps doesn't show anything interesting).
This parses well:
{ ngx_string("daemon"),
NGX_MAIN_CONF|NGX_DIRECT_CONF|NGX_CONF_FLAG,
ngx_conf_set_flag_slot,
0,
offsetof(ngx_core_conf_t, daemon),
NULL },
And this is where the fun stops:
{ ngx_string("off"), NGX_HTTP_REQUEST_BODY_FILE_OFF },
{ ngx_string("on"), NGX_HTTP_REQUEST_BODY_FILE_ON },
Also, more data:
by changing the second line to this
(?P<where>(([A-Z0-9_]{1,20})\s*\|?){1,6}?)\s{0,10}?,\s{0,10}?
, it finally completes in reasonable time, but the exponential blowing up is still there, just bearable:
trying { ngx_string("off"), NGX_HTTP_REQUEST_BODY_FILE
Took 0.033483 s
trying { ngx_string("off"), NGX_HTTP_REQUEST_BODY_FILE_
Took 0.038528 s
trying { ngx_string("off"), NGX_HTTP_REQUEST_BODY_FILE_O
Took 0.044108 s
trying { ngx_string("off"), NGX_HTTP_REQUEST_BODY_FILE_OF
Took 0.053547 s
Also, interestingly a JS-based Python regex (emulator?) parser can eat it like it's breakfast for PCRE champs: https://www.debuggex.com/r/S__vSvp8-LGLuCLQ
Oh, and maybe someone should create the pathological-regex tag :|

The problem is (([A-Z0-9_]+)\s*\|?)+? in your original regex or (([A-Z0-9_]{1,20})\s*\|?){1,6}? in your test regex. The {1,20} and {1,6} only serves to inhibit the exponential backtracking a bit.
Since \s* and \|? are optional, the regex can expand to the classic example of exponential backtracking (([A-Z0-9_]+))+? when the input contains only [A-Z0-9_] without spaces or bar |, but does not matches the rest of the regex.
For example, matching (?P<where>(([A-Z0-9_]+)\s*\|?)+?)\s*?,\s*? against AAAAAAAAAAAAAA (, is missing) would cause the engine to check all possibility of splitting the string up, each token to be matched at different iterations of the outer repetition:
AAAAAAAAAAAAAA
AAAAAAAAAAAAA A
AAAAAAAAAAAA AA
AAAAAAAAAAAA A A
AAAAAAAAAAA AAA
...
A AAAAAAAAAAAAA
A AAAAAAAAAAAA A
A AAAAAAAAAAA AA
...
A A A A A A A A A A A A A A
On a closer look, the rest of your regex also have excessive backtracking problem.
Take (?P<bla2>[^\n}]+?)\s*?,\s*? for example. [^\n}] can match a space (or a tab, or a number of other space characters), and so can \s. This may cause excessive backtracking if your non-matching input contains a long string of spaces. There might also be correctness issue, since , can be matched by [^\n}] also.
On a side note, Python re is an NFA engine without any optimization to mitigate the exponential backtracking problem.
To mitigate the exponential backtracking:
{\s*
ngx_string\("(?P<name>[a-z0-9_]+)"\)\s*,\s*
(?P<where>[A-Z0-9_]+(?:\s*\|\s*[A-Z0-9_]+)*)\s*,\s*
(?P<bla>[^\n}]+?)\s*,\s*
(?P<bla2>[^\n}]+?)\s*,\s*
(?P<bla3>[^\n}]+?)\s*,\s*
(?P<bla4>[^\n}]+?)\s*
}
The excessive backtracking and correctness problem with [^\n}] are still not fixed. The , in function call can be wrongly recognized as part of the outer block {} if the arguments are not on different lines.
The general solution would require recursive regex, since you can call a function as pass its result as argument, e.g. offsetof(ngx_core_conf_t, daemon), but the recursive regex feature is not available in re package. A less general solution would be to match up to some levels of nested parentheses, for example 1 level of nesting:
(?P<bla>(?:\([^)]*\)|[^,()])+),\s*
And the whole regex is:
{\s*?
ngx_string\("(?P<name>[a-z0-9_]+)"\)\s*,\s*
(?P<where>[A-Z0-9_]+(?:\s*\|\s*[A-Z0-9_]+)*)\s*,
(?P<bla>(?:\([^)]*\)|[^,()])+),\s*
(?P<bla2>(?:\([^)]*\)|[^,()])+),\s*
(?P<bla3>(?:\([^)]*\)|[^,()])+),\s*
(?P<bla4>(?:\([^)]*\)|[^,()])+)
}
DEMO
There are 2 caveats:
The <bla*> capturing groups will contain spaces at the end. The regex will be a bit longer if you want to remove spaces, and prevent possible excessive backtracking. You can try adding \s* before the , back into this demo here.
It assumes that () are not part of any string literal.

(?P<where>(([A-Z0-9_]+)\s*\|?)+?)
^ ^
This is where your regex is exploding.Read http://www.regular-expressions.info/catastrophic.html .
On every failure it goes back one step to check if there is a match.This is creating an explosion of steps and possibilities for the regex engine.

Related

Python regex: How to implement a regex expression that checks for matching set of brackets?

I need to implement a python regex to capture two statements:
Any sequence of characters with the restriction that any '{' be matched with '}'
Any sequence of characters except '{', '}', ';'
Case 2 is straightforward: r'[^\{\};]+
For case 1, I'm not sure. Nested brackets are not allowed but "hello world {it is} {me}" should be OK. The closest I have right now is r'.*?\{.*?\}.*?, but it matches "an {apple}" but not "an {apple} boy". How do I correct for this?
These should be OK:
{a} {boy} lives here
{a} boy {lives} here
a {boy} lives here
a boy lives here
These are not OK:
{{a} boy lives here}
a boy {{lives}} here
a boy {lives} here {
a boy lives here }
If your brackets are not nested and not escaped then you can use this regex to validate your input:
^(?:[^{}]*{[^}]*})*[^{}]*$
RegEx Demo
This regex will match an empty string also. If you want to avoid then use (?!$) negative lookahead to disallow empty match:
^(?!$)(?:[^{}]*{[^}]*})*[^{}]*$
RegEx Details:
^: Start
(?:: Start a non-capture group
[^{}]*: Match 0 or more of any characters that is not { and }
{[^}]*}: Match a {...} substring
)*: End non-capture group. * lets this group repeat 0 or more times.
[^{}]*: Match 0 or more of any characters that is not { and }
$: ENd
Edit: the question as posed before the edition looked like balanced parenthesis problem - which is not solvable with regular expressions. See explanation below.
But after the edit, it turned out that it's about working with only a single level of parenthesis, i.e. without nesting them. This is possible, and can be seen in anubhava's answer.
You can't.
Regular expressions (as defined by computer science) are unable to perform such task, as they, essentially, lack the "memory" required to do it.
Think of regular expression as a state machine with a finite amount of states. Each character you see in the input moves you from one state to the other.
As soon as you provide a long enough string of open parenthesis, i.e. of length larger than the number of states, you would have to land in a state that you already visited, leaving out any count of open parenthesis you've encountered.
The model of computation you are looking for here would be (at least) something called Context Free Grammar. A machine model that works with such grammars is called Push Down Automata (by analogy, in case of regular expressions it was Finite Automata).
Or maybe you can?
There are some flavours of Regex that doesn't follow the computer science term, and have additional features like recursive expressions. That would be able to capture the parenthesis and find whether all of them close in the right order.
An example of that can be seen here.

REGEX: Negative lookbehind with multiple whitespaces [duplicate]

I am trying to use lookbehinds in a regular expression and it doesn't seem to work as I expected. So, this is not my real usage, but to simplify I will put an example. Imagine I want to match "example" on a string that says "this is an example". So, according to my understanding of lookbehinds this should work:
(?<=this\sis\san\s*?)example
What this should do is find "this is an", then space characters and finally match the word "example". Now, it doesn't work and I don't understand why, is it impossible to use '+' or '*' inside lookbehinds?
I also tried those two and they work correctly, but don't fulfill my needs:
(?<=this\sis\san\s)example
this\sis\san\s*?example
I am using this site to test my regular expressions: http://gskinner.com/RegExr/
Many regular expression libraries do only allow strict expressions to be used in look behind assertions like:
only match strings of the same fixed length: (?<=foo|bar|\s,\s) (three characters each)
only match strings of fixed lengths: (?<=foobar|\r\n) (each branch with fixed length)
only match strings with a upper bound length: (?<=\s{,4}) (up to four repetitions)
The reason for these limitations are mainly because those libraries can’t process regular expressions backwards at all or only a limited subset.
Another reason could be to avoid authors to build too complex regular expressions that are heavy to process as they have a so called pathological behavior (see also ReDoS).
See also section about limitations of look-behind assertions on Regular-Expressions.info.
Hey if your not using python variable look behind assertion you can trick the regex engine by escaping the match and starting over by using \K.
This site explains it well .. http://www.phpfreaks.com/blog/pcre-regex-spotlight-k ..
But pretty much when you have an expression that you match and you want to get everything behind it using \K will force it to start over again...
Example:
string = '<a this is a tag> with some information <div this is another tag > LOOK FOR ME </div>'
matching /(\<a).+?(\<div).+?(\>)\K.+?(?=\<div)/ will cause the regex to restart after you match the ending div tag so the regex won't include that in the result. The (?=\div) will make the engine get everything in front of ending div tag
What Amber said is true, but you can work around it with another approach: A non-capturing parentheses group
(?<=this\sis\san)(?:\s*)example
That make it a fixed length look behind, so it should work.
You can use sub-expressions.
(this\sis\san\s*?)(example)
So to retrieve group 2, "example", $2 for regex, or \2 if you're using a format string (like for python's re.sub)
Most regex engines don't support variable-length expressions for lookbehind assertions.

Regex taking forever on short string.

I'm looking through a bunch of strings, trying to match some with the following pattern.
location_pattern = re.compile( r"""
\b
(?P<location>
([A-Z]\w*[ -]*)+[, ]+
(
[A-Z]{2}
|
[A-Z]\w+\ *\d ##
)
)
\b
""", flags=re.VERBOSE)
Now this regex runs ok on almost all my data set, but it takes forever (well, 5 seconds) on this particular string:
' JAVASCRIPT SOFTWARE ARCHITECT, SUCCESSFUL SERIAL'
There are a bunch of strings like this one (all caps, lots of space chars) at some point in my input data and the program is greatly slowed when it hits it. I tried taking out different parts of the regex, and it turns out the culprit is the
\ *\d at the end of the commented line.
I'd like to understand how this is causing the regex validation to take so long.
Can anyone help?
The reason why removing \ *\d works is because you just turn the example in the question from a non-matching case into a matching case. In a backtracking engine, a matching case usually takes (much) less time than a non-matching case, since in non-matching case, the engine must exhaust the search space to come to that conclusion.
Where the problem lies is correctly pointed out by Fede (the explanation is rather hand-waving and inaccurate, though).
([A-Z]\w*[ -]*)+
Since [ -]* is optional and \w can match [A-Z], the regex above degenerates to ([A-Z][A-Z]*)+, which matches the classic example of catastrophic backtracking (A*)*. The degenerate form also shows that the problem manifests on long strings of uppercase letter.
The fragment by itself doesn't cause much harm. However, as long as the sequel (whatever follows after the fragment above) fails, it will cause catastrophic backtracking.
Here is one way to rewrite it (without knowing your exact requirement):
[A-Z]\w*(?:[ -]+[A-Z]\w*)*[ -]*
By forcing [ -]+ to be at least once, the pattern can no longer match a string of uppercase letter in multiple ways.
Additionally to Greg's answer, you have also this pattern:
([A-Z]\w*[ -]*)+
^----^-^---- Note embedded quantifiers!
Using quantifiers inside a repeated group (even worse 2 quantifiers as you have) usually generates catastrophic backtracking issues. Hence, I'd re think the regex.
Comment: I can offer you another regex if you add more sample data and your expected output by updating my answer later.

Why does this take so long to match? Is it a bug?

I need to match certain URLs in web application, i.e. /123,456,789, and wrote this regex to match the pattern:
r'(\d+(,)?)+/$'
I noticed that it does not seem to evaluate, even after several minutes when testing the pattern:
re.findall(r'(\d+(,)?)+/$', '12345121,223456,123123,3234,4523,523523')
The expected result would be that there were no matches.
This expression, however, executes almost immediately (note the trailing slash):
re.findall(r'(\d+(,)?)+/$', '12345121,223456,123123,3234,4523,523523/')
Is this a bug?
There is some catastrophic backtracking going on that will cause an exponential amount of processing depending on how long the non-match string is. This has to do with your nested repetitions and optional comma (even though some regex engines can determine that this wouldn't be a match with attempting all of the extraneous repetition). This is solved by optimizing the expression.
The easiest way to accomplish this is to just look for 1+ digits or commas followed by a slash and the end of the string: [\d,]+/$. However, that is not perfect since it would allow for something like ,123,,4,5/.
For this you can use a slightly optimized version of your initial try: (?:\d,?)+/$. First, I made your repeating group non-capturing ((?:...)) which isn't necessary but it provides for a "cleaner match". Next, and the only crucial step, I stopped repeating the \d inside of the group since the group is already repeating. Finally, I removed the unnecessary group around the optional , since ? only affects the last character. Pretty much this will look for one digit, maybe a comma, then repeat, and finally followed by a trailing /.
This can still match an odd string 1,2,3,/, so for the heck of it I improved your original regex with a negative lookbehind: (?:\d,?)+(?<!,)/$. This will assert that there is no comma directly before the trailing /.
First off, I must say that this is not a BUG. What your regex is doing is that it's trying all the possibilities due to the nested repeating patters. Sometimes this process can gobble up a lot of time and resources and when it gets really bad, it’s called catastrophic backtracking.
This is the code of findall function in python source code:
def findall(pattern, string, flags=0):
"""Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result."""
return _compile(pattern, flags).findall(string)
As you see it just use the compile() function, so based on _compile() function that actually use Traditional NFA that python use for its regex matching, and base on
this short explain about backtracking in regular expression in Mastering Regular Expressions, Third Edition, by Jeffrey E. F. Friedl!
The essence of an NFA engine is this: it considers each subexpression or component in turn, and whenever it needs to decide between two equally viable options,
it selects one and remembers the other to return to later if need be.
Situations where it has to decide among courses of action include anything with a
quantifier (decide whether to try another match), and alternation (decide which
alter native to try, and which to leave for later).
Whichever course of action is attempted, if it’s successful and the rest of the regex
is also successful, the match is finished. If anything in the rest of the regex eventually causes failure, the regex engine knows it can backtrack to where it chose the
first option, and can continue with the match by trying the other option. This way,
it eventually tries all possible permutations of the regex (or at least as many as
needed until a match is found).
Let's go inside your pattern: So you have r'(\d+(,)?)+/$' with this string '12345121,223456,123123,3234,4523,523523' we have this steps:
At first, the first part of your string (12345121) is matched with \d+, then , is matched with (,)? .
Then based on first step, the whole string is match due to + after the grouping ((\d+(,)?)+)
Then at the end, there is nothing for /$ to be matched. Therefore, (\d+(,)?)+ needs to "backtrack" to one character before the last for check for /$. Again, it don't find any proper match, so after that it's (,)'s turn to backtrack, then \d+ will backtrack, and this backtracking will be continue to end till it return None.
So based on the length of your string it takes time, which in this case is very high, and it create a nested quantifiers entirely!
As an approximately benchmarking, in this case, you have 39 character so you need 3^39 backtracking attempts (we have 3 methods for backtrack).
Now for better understanding, I measure the runtime of the program while changing the length of the string:
'12345121,223456,123123,3234,4523,' 3^33 = 5.559060567×10¹⁵
~/Desktop $ time python ex.py
real 0m3.814s
user 0m3.818s
sys 0m0.000s
'12345121,223456,123123,3234,4523,5' 3^24 = 1.66771817×10¹⁶ #X2 before
~/Desktop $ time python ex.py
real 0m5.846s
user 0m5.837s
sys 0m0.015s
'12345121,223456,123123,3234,4523,523' 3^36= 1.500946353×10¹⁷ #~10X before
~/Desktop $ time python ex.py
real 0m15.796s
user 0m15.803s
sys 0m0.008s
So to avoid this problem you can use one of the below ways:
Atomic grouping (Currently doesn't support in Python, A RFE was created to add it to Python 3)
Reduction the possibility of backtracking by breaking the nested groups to separate regexes.
To avoid the catastrophic backtracking I suggest
r'\d+(,\d+)*/$'

heavy regex - really time consuming

I have the following regex to detect start and end script tags in the html file:
<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:[^s]))*)</script>
meaning in short it will catch: <script "NOT THIS</s" > "NOT THIS</s" </script>
it works but needs really long time to detect <script>,
even minutes or hours for long strings
The lite version works perfectly even for long string:
<script[^<]*>[^<]*</script>
however, the extended pattern I use as well for other tags like <a> where < and > are possible to appears also as values of attributes.
python test:
import re
pattern = re.compile('<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:^s]))*)</script>', re.I + re.DOTALL)
re.search(pattern, '11<script type="text/javascript"> easy>example</script>22').group()
re.search(pattern, '<script type="text/javascript">' + ('hard example' * 50) + '</script>').group()
how can I fix it?
The inner part of regex (after <script>) should be changed and simplified.
PS :) Anticipate your answers about the wrong approach like using regex in html parsing,
I know very well many html/xml parsers, and what I can expect in often broken html code, and regex is really useful here.
comment:
well, I need to handle:
each <a < document like this.border="5px;">
and approach is to use parsers and regex together
BeautifulSoup is only 2k lines, which not handling every html and just extends regex from sgmllib.
and the main reason is that I must know exact the position where every tag starts and stop. and every broken html must be handled.
BS is not perfect, sometimes happens:
BeautifulSoup('< scriPt\n\n>a<aa>s< /script>').findAll('script') == []
#Cylian:
atomic grouping as you know is not available in python's re.
so non-geedy everything .*? until <\s/\stag\s*>** is a winner at this time.
I know that is not perfect in that case:
re.search('<\sscript.?<\s*/\sscript\s>','< script </script> shit </script>').group()
but I can handle refused tail in the next parsing.
It's pretty obvious that html parsing with regex is not one battle figthing.
Use an HTML parser like beautifulsoup.
See the great answers for "Can I remove script tags with beautifulsoup?".
If your only tool is a hammer, every problem starts looking like a nail. Regular expressions are a powerful hammer but not always the best solution for some problems.
I guess you want to remove scripts from HTML posted by users for security reasons. If security is the main concern, regular expressions are hard to implement because there are so many things a hacker can modify to fool your regex, yet most browsers will happily evaluate... An specialized parser is easier to use, performs better and is safer.
If you are still thinking "why can't I use regex", read this answer pointed by mayhewr's comment. I could not put it better, the guy nailed it, and his 4433 upvotes are well deserved.
I don't know python, but I know regular expressions:
if you use the greedy/non-greedy operators you get a much simpler regex:
<script.*?>.*?</script>
This is assuming there are no nested scripts.
The problem in pattern is that it is backtracking. Using atomic groups this issue could be solved. Change your pattern to this**
<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>
^^^^^ ^^^^^
Explanation
<!--
<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>
Match the characters “<script” literally «<script»
Python does not support atomic grouping «(?>[^<]+?|<(?:[^/]|/(?:[^s])))*»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+?»
Match any character that is NOT a “<” «[^<]+?»
Between one and unlimited times, as few times as possible, expanding as needed (lazy) «+?»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))»
Match the character “<” literally «<»
Match the regular expression below «(?:[^/]|/(?:[^s]))»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
Match any character that is NOT a “/” «[^/]»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
Match the character “/” literally «/»
Match the regular expression below «(?:[^s])»
Match any character that is NOT a “s” «[^s]»
Match the character “>” literally «>»
Python does not support atomic grouping «(?>[^<]+|<(?:[^/]|/(?:[^s]))*)»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+»
Match any character that is NOT a “<” «[^<]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))*»
Match the character “<” literally «<»
Match the regular expression below «(?:[^/]|/(?:[^s]))*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
Match any character that is NOT a “/” «[^/]»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
Match the character “/” literally «/»
Match the regular expression below «(?:[^s])»
Match any character that is NOT a “s” «[^s]»
Match the characters “</script>” literally «</script>»
-->

Categories

Resources