heavy regex - really time consuming

heavy regex - really time consuming - python

I have the following regex to detect start and end script tags in the html file:
<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:[^s]))*)</script>
meaning in short it will catch: <script "NOT THIS</s" > "NOT THIS</s" </script>
it works but needs really long time to detect <script>,
even minutes or hours for long strings
The lite version works perfectly even for long string:
<script[^<]*>[^<]*</script>
however, the extended pattern I use as well for other tags like <a> where < and > are possible to appears also as values of attributes.
python test:
import re
pattern = re.compile('<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:^s]))*)</script>', re.I + re.DOTALL)
re.search(pattern, '11<script type="text/javascript"> easy>example</script>22').group()
re.search(pattern, '<script type="text/javascript">' + ('hard example' * 50) + '</script>').group()
how can I fix it?
The inner part of regex (after <script>) should be changed and simplified.
PS :) Anticipate your answers about the wrong approach like using regex in html parsing,
I know very well many html/xml parsers, and what I can expect in often broken html code, and regex is really useful here.
comment:
well, I need to handle:
each <a < document like this.border="5px;">
and approach is to use parsers and regex together
BeautifulSoup is only 2k lines, which not handling every html and just extends regex from sgmllib.
and the main reason is that I must know exact the position where every tag starts and stop. and every broken html must be handled.
BS is not perfect, sometimes happens:
BeautifulSoup('< scriPt\n\n>a<aa>s< /script>').findAll('script') == []
#Cylian:
atomic grouping as you know is not available in python's re.
so non-geedy everything .*? until <\s/\stag\s*>** is a winner at this time.
I know that is not perfect in that case:
re.search('<\sscript.?<\s*/\sscript\s>','< script </script> shit </script>').group()
but I can handle refused tail in the next parsing.
It's pretty obvious that html parsing with regex is not one battle figthing.

Use an HTML parser like beautifulsoup.
See the great answers for "Can I remove script tags with beautifulsoup?".
If your only tool is a hammer, every problem starts looking like a nail. Regular expressions are a powerful hammer but not always the best solution for some problems.
I guess you want to remove scripts from HTML posted by users for security reasons. If security is the main concern, regular expressions are hard to implement because there are so many things a hacker can modify to fool your regex, yet most browsers will happily evaluate... An specialized parser is easier to use, performs better and is safer.
If you are still thinking "why can't I use regex", read this answer pointed by mayhewr's comment. I could not put it better, the guy nailed it, and his 4433 upvotes are well deserved.

I don't know python, but I know regular expressions:
if you use the greedy/non-greedy operators you get a much simpler regex:
<script.*?>.*?</script>
This is assuming there are no nested scripts.

The problem in pattern is that it is backtracking. Using atomic groups this issue could be solved. Change your pattern to this**
<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>
^^^^^ ^^^^^
Explanation
<!--
<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>
Match the characters “<script” literally «<script»
Python does not support atomic grouping «(?>[^<]+?|<(?:[^/]|/(?:[^s])))*»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+?»
Match any character that is NOT a “<” «[^<]+?»
Between one and unlimited times, as few times as possible, expanding as needed (lazy) «+?»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))»
Match the character “<” literally «<»
Match the regular expression below «(?:[^/]|/(?:[^s]))»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
Match any character that is NOT a “/” «[^/]»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
Match the character “/” literally «/»
Match the regular expression below «(?:[^s])»
Match any character that is NOT a “s” «[^s]»
Match the character “>” literally «>»
Python does not support atomic grouping «(?>[^<]+|<(?:[^/]|/(?:[^s]))*)»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+»
Match any character that is NOT a “<” «[^<]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))*»
Match the character “<” literally «<»
Match the regular expression below «(?:[^/]|/(?:[^s]))*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
Match any character that is NOT a “/” «[^/]»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
Match the character “/” literally «/»
Match the regular expression below «(?:[^s])»
Match any character that is NOT a “s” «[^s]»
Match the characters “</script>” literally «</script>»
-->

Related

Regex Statement to only match parts of a string for comparison - Python

What I am trying to do is match values from one file to another, but I only need to match the first portion of the string and the last portion.
I am reading each file into a list, and manipulating these based on different Regex patterns I have created. Everything works, except when it comes to these type of values:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
In this example, I only want to match 'V-1\ZDS\R\EMBO-20' and then compare the '24' value at the end of the string. The number x in '20-x:', can vary and doesn't matter in terms of comparisons, as long as the first and last parts of this string match.
This is the Regex I am using:
re.compile(r"(?:.*V-1\\ZDS\\R\\EMBO-20-\d.*)(:\d*\w.*)")
Once I filter down the list, I use the following function to return the difference between the two sets:
funcDiff = lambda x, y: list((set(x)- set(y))) + list((set(y)- set(x)))
Is there a way to take the list of differences and filter out the ones that have matching values after the
:
as mentioned above?
I apologize is this is an obvious answer, I'm new to Python and Regex!
The output I get is the differences between the entire strings, so even if the first and last part of the string match, if the number following the 'EMBO-20-x' doesn't also match, it returns it as being different.

Before discussing your question, regex101 is an incredibly useful tool for this type of thing.
Your issue stems from two issues:
1.) The way you used .*
2.) Greedy vs. Nongreedy matches
.* kinda sucks
.* is a regex expression that is very rarely what you actually want.
As a quick aside, a useful regex expression is [^c]* or [^c]+. These expressions match any character except the letter c, with the first expression matching 0 or more, and the second matched 1 or more.
.* will match all characters as many times as it can. Instead, try to start your regex patterns with more concrete starting points. Two good ways to do this are lookbehind expressions and anchors.
Another quick aside, it's likely that you are misusing regex.match and regex.find. match will only return a match that begins at the start of the string, while find will return matches anywhere in the input string. This could be the reason you included the .* in the first place, to allow a .match call to return a match deeper in the string.
Lookbehind Expressions
There are more complete explanations online, but in short, regex patterns like:
(?<=test)foo
will match the text foo, but only if test is right in front of it. To be more clear, the following strings will not match that regex:
foo
test-foo
test foo
but the following string will match:
testfoo
This will only match the text foo, though.
Anchors
Another option is anchors. ^ and $ are special characters, matching the start and end of a line of text. If you know your regex pattern will match exactly one line of text, start it with ^ and end it with $.
Leading patterns with .* and ending with .* are likely the source of your issue. Although you did not include full examples of your input or your code, you likely used match as opposed to find.
In regex, . matches any character, and * means 0 or more times. This means that for any input, your pattern will match the entire string.
Greedy vs. Non-Greedy qualifiers
The second issue is related to greediness. When your regex patterns have a * in them, they can match 0 or more characters. This can hide problems, as entire * expressions can be skipped. Your regex is likely matched several lines of text as one match, and hiding multiple records in a single .*.
The Actual Answer
Taking all of this in to consideration, let's assume that your input data looks like this:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
V-1\ZDS\R\EMBO-20-3:93
V-1\ZDS\R\EMBO-20-6:22309
V-1\ZDS\R\EMBO-20-8:2238
V-1\ZDS\R\EMBO-20-3:28
A better regular expression would be:
^V-1\\ZDS\\R\\EMBO-20-\d:(\d+)$
To visualize this regex in action, follow this link.
There are several differences I would like to highlight:
Starting the expression with ^ and ending with $. This forces the regex to match exactly one line. Even though the pattern works without these characters, it's good practice when working with regex to be as explicit as possible.
No useless non-capturing group. Your example had a (?:) group at the start. This denotes a group that does not capture it's match. It's useful if you want to match a subpattern multiple times ((?:ab){5} matches ababababab without capturing anything). However, in your example, it did nothing :)
Only capturing the number. This makes it easier to extract the value of the capture groups.
No use of *, one use of +. + works like *, but it matches 1 or more. This is often more correct, as it prevents 'skipping' entire characters.

REGEX: Negative lookbehind with multiple whitespaces [duplicate]

I am trying to use lookbehinds in a regular expression and it doesn't seem to work as I expected. So, this is not my real usage, but to simplify I will put an example. Imagine I want to match "example" on a string that says "this is an example". So, according to my understanding of lookbehinds this should work:
(?<=this\sis\san\s*?)example
What this should do is find "this is an", then space characters and finally match the word "example". Now, it doesn't work and I don't understand why, is it impossible to use '+' or '*' inside lookbehinds?
I also tried those two and they work correctly, but don't fulfill my needs:
(?<=this\sis\san\s)example
this\sis\san\s*?example
I am using this site to test my regular expressions: http://gskinner.com/RegExr/

Many regular expression libraries do only allow strict expressions to be used in look behind assertions like:
only match strings of the same fixed length: (?<=foo|bar|\s,\s) (three characters each)
only match strings of fixed lengths: (?<=foobar|\r\n) (each branch with fixed length)
only match strings with a upper bound length: (?<=\s{,4}) (up to four repetitions)
The reason for these limitations are mainly because those libraries can’t process regular expressions backwards at all or only a limited subset.
Another reason could be to avoid authors to build too complex regular expressions that are heavy to process as they have a so called pathological behavior (see also ReDoS).
See also section about limitations of look-behind assertions on Regular-Expressions.info.

Hey if your not using python variable look behind assertion you can trick the regex engine by escaping the match and starting over by using \K.
This site explains it well .. http://www.phpfreaks.com/blog/pcre-regex-spotlight-k ..
But pretty much when you have an expression that you match and you want to get everything behind it using \K will force it to start over again...
Example:
string = '<a this is a tag> with some information <div this is another tag > LOOK FOR ME </div>'
matching /(\<a).+?(\<div).+?(\>)\K.+?(?=\<div)/ will cause the regex to restart after you match the ending div tag so the regex won't include that in the result. The (?=\div) will make the engine get everything in front of ending div tag

What Amber said is true, but you can work around it with another approach: A non-capturing parentheses group
(?<=this\sis\san)(?:\s*)example
That make it a fixed length look behind, so it should work.

You can use sub-expressions.
(this\sis\san\s*?)(example)
So to retrieve group 2, "example", $2 for regex, or \2 if you're using a format string (like for python's re.sub)

Most regex engines don't support variable-length expressions for lookbehind assertions.

Understanding regex pattern used to find string between strings in html

I have the following html file:
<!-- <div class="_5ay5"><table class="uiGrid _51mz" cellspacing="0" cellpadding="0"><tbody><tr class="_51mx"><td class="_51m-"><div class="_u3y"><div class="_5asl"><a class="_47hq _5asm" href="/Dev/videos/1610110089242029/" aria-label="Who said it?" ajaxify="/Dev/videos/1610110089242029/" rel="theater">
In order to pull the string of numbers between videos/ and /", I'm using the following method that I found:
import re
Source_file = open('source.html').read()
result = re.compile('videos/(.*?)/"').search(Source_file)
print result
I've tried Googling an explanation for exactly how the (.*?) works in this particular implementation, but I'm still unclear. Could someone explain this to me? Is this what's known as a "non-greedy" match? If yes, what does that mean?

The ? in this context is a special operator on the repetition operators (+, *, and ?). In engines where it is available this causes the repetition to be lazy or non-greedy or reluctant or other such terms. Typically repetition is greedy which means that it should match as much as possible. So you have three types of repetition in most modern perl-compatible engines:
.* # Match any character zero or more times
.*? # Match any character zero or more times until the next match (reluctant)
.*+ # Match any character zero or more times and don't stop matching! (possessive)
More information can be found here: http://www.regular-expressions.info/repeat.html#lazy for reluctant/lazy and here: http://www.regular-expressions.info/possessive.html for possessive (which I'll skip discussing in this answer).
Suppose we have the string aaaa. We can match all of the a's with /(a+)a/. Literally this is
match one or more a's followed by an a.
This will match aaaa. The regex is greedy and will match as many a's as possible. The first submatch is aaa.
If we use the regex /(a+?)a this is
reluctantly match one or more as followed by an a
or
match one or more as until we reach another a
That is, only match what we need. So in this case the match is aa and the first submatch is a. We only need to match one a to satisfy the repetition and then it is followed by an a.
This comes up a lot when using regex to match within html tags, quotes and the suchlike -- usually reserved for quick and dirty operations. That is to say using regex to extract from very large and complex html strings or quoted strings with escape sequence can cause a lot of problems but it's perfectly fine for specific use cases. So in your case we have:
/Dev/videos/1610110089242029/
The expression needs to match videos/ followed by zero or more characters followed by /". If there is only one videos URL there that's just fine without being reluctant.
However we have
/videos/1610110089242029/" ... ajaxify="/Dev/videos/1610110089242029/"
Without reluctance, the regex will match:
1610110089242029/" ... ajaxify="/Dev/videos/1610110089242029
It tries to match as much as possible and / and " satisfy . just fine. With reluctance, the matching stops at the first /" (actually it backtracks but you can read about that separately). Thus you only get the part of the url you need.

It can be explained in a simple way:
.: match anything (any character),
*: any number of times (at least zero times),
?: as few times as possible (hence non-greedy).
videos/(.*?)/"
as a regular expression matches (for example)
videos/1610110089242029/"
and the first capturing group returns 1610110089242029, because any of the digits is part of “any character” and there are at least zero characters in it.
The ? causes something like this:
videos/1610110089242029/" something else … "videos/2387423470237509/"
to properly match as 1610110089242029 and 2387423470237509 instead of as 1610110089242029/" something else … "videos/2387423470237509, hence “as few times as possible”, hence “non-greedy”.

The . means any character. The * means any number of times, including zero. The ? does indeed mean non-greedy; that means that it will try to capture as few characters as possible, i.e., if the regex encounters a /, it could match it with the ., but it would rather not because the . is non-greedy, and since the next character in the regex is happy to match /, the . doesn't have to. If you didn't have the ?, that . would eat up the whole rest of the file because it would be chomping at the bit to match as many things as possible, and since it matches everything, it would go on forever.

Regex taking forever on short string.

I'm looking through a bunch of strings, trying to match some with the following pattern.
location_pattern = re.compile( r"""
\b
(?P<location>
([A-Z]\w*[ -]*)+[, ]+
(
[A-Z]{2}
|
[A-Z]\w+\ *\d ##
)
)
\b
""", flags=re.VERBOSE)
Now this regex runs ok on almost all my data set, but it takes forever (well, 5 seconds) on this particular string:
' JAVASCRIPT SOFTWARE ARCHITECT, SUCCESSFUL SERIAL'
There are a bunch of strings like this one (all caps, lots of space chars) at some point in my input data and the program is greatly slowed when it hits it. I tried taking out different parts of the regex, and it turns out the culprit is the
\ *\d at the end of the commented line.
I'd like to understand how this is causing the regex validation to take so long.
Can anyone help?

The reason why removing \ *\d works is because you just turn the example in the question from a non-matching case into a matching case. In a backtracking engine, a matching case usually takes (much) less time than a non-matching case, since in non-matching case, the engine must exhaust the search space to come to that conclusion.
Where the problem lies is correctly pointed out by Fede (the explanation is rather hand-waving and inaccurate, though).
([A-Z]\w*[ -]*)+
Since [ -]* is optional and \w can match [A-Z], the regex above degenerates to ([A-Z][A-Z]*)+, which matches the classic example of catastrophic backtracking (A*)*. The degenerate form also shows that the problem manifests on long strings of uppercase letter.
The fragment by itself doesn't cause much harm. However, as long as the sequel (whatever follows after the fragment above) fails, it will cause catastrophic backtracking.
Here is one way to rewrite it (without knowing your exact requirement):
[A-Z]\w*(?:[ -]+[A-Z]\w*)*[ -]*
By forcing [ -]+ to be at least once, the pattern can no longer match a string of uppercase letter in multiple ways.

Additionally to Greg's answer, you have also this pattern:
([A-Z]\w*[ -]*)+
^----^-^---- Note embedded quantifiers!
Using quantifiers inside a repeated group (even worse 2 quantifiers as you have) usually generates catastrophic backtracking issues. Hence, I'd re think the regex.
Comment: I can offer you another regex if you add more sample data and your expected output by updating my answer later.

Why does this take so long to match? Is it a bug?

I need to match certain URLs in web application, i.e. /123,456,789, and wrote this regex to match the pattern:
r'(\d+(,)?)+/$'
I noticed that it does not seem to evaluate, even after several minutes when testing the pattern:
re.findall(r'(\d+(,)?)+/$', '12345121,223456,123123,3234,4523,523523')
The expected result would be that there were no matches.
This expression, however, executes almost immediately (note the trailing slash):
re.findall(r'(\d+(,)?)+/$', '12345121,223456,123123,3234,4523,523523/')
Is this a bug?

There is some catastrophic backtracking going on that will cause an exponential amount of processing depending on how long the non-match string is. This has to do with your nested repetitions and optional comma (even though some regex engines can determine that this wouldn't be a match with attempting all of the extraneous repetition). This is solved by optimizing the expression.
The easiest way to accomplish this is to just look for 1+ digits or commas followed by a slash and the end of the string: [\d,]+/$. However, that is not perfect since it would allow for something like ,123,,4,5/.
For this you can use a slightly optimized version of your initial try: (?:\d,?)+/$. First, I made your repeating group non-capturing ((?:...)) which isn't necessary but it provides for a "cleaner match". Next, and the only crucial step, I stopped repeating the \d inside of the group since the group is already repeating. Finally, I removed the unnecessary group around the optional , since ? only affects the last character. Pretty much this will look for one digit, maybe a comma, then repeat, and finally followed by a trailing /.
This can still match an odd string 1,2,3,/, so for the heck of it I improved your original regex with a negative lookbehind: (?:\d,?)+(?<!,)/$. This will assert that there is no comma directly before the trailing /.

First off, I must say that this is not a BUG. What your regex is doing is that it's trying all the possibilities due to the nested repeating patters. Sometimes this process can gobble up a lot of time and resources and when it gets really bad, it’s called catastrophic backtracking.
This is the code of findall function in python source code:
def findall(pattern, string, flags=0):
"""Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result."""
return _compile(pattern, flags).findall(string)
As you see it just use the compile() function, so based on _compile() function that actually use Traditional NFA that python use for its regex matching, and base on
this short explain about backtracking in regular expression in Mastering Regular Expressions, Third Edition, by Jeffrey E. F. Friedl!
The essence of an NFA engine is this: it considers each subexpression or component in turn, and whenever it needs to decide between two equally viable options,
it selects one and remembers the other to return to later if need be.
Situations where it has to decide among courses of action include anything with a
quantifier (decide whether to try another match), and alternation (decide which
alter native to try, and which to leave for later).
Whichever course of action is attempted, if it’s successful and the rest of the regex
is also successful, the match is finished. If anything in the rest of the regex eventually causes failure, the regex engine knows it can backtrack to where it chose the
first option, and can continue with the match by trying the other option. This way,
it eventually tries all possible permutations of the regex (or at least as many as
needed until a match is found).
Let's go inside your pattern: So you have r'(\d+(,)?)+/$' with this string '12345121,223456,123123,3234,4523,523523' we have this steps:
At first, the first part of your string (12345121) is matched with \d+, then , is matched with (,)? .
Then based on first step, the whole string is match due to + after the grouping ((\d+(,)?)+)
Then at the end, there is nothing for /$ to be matched. Therefore, (\d+(,)?)+ needs to "backtrack" to one character before the last for check for /$. Again, it don't find any proper match, so after that it's (,)'s turn to backtrack, then \d+ will backtrack, and this backtracking will be continue to end till it return None.
So based on the length of your string it takes time, which in this case is very high, and it create a nested quantifiers entirely!
As an approximately benchmarking, in this case, you have 39 character so you need 3^39 backtracking attempts (we have 3 methods for backtrack).
Now for better understanding, I measure the runtime of the program while changing the length of the string:
'12345121,223456,123123,3234,4523,' 3^33 = 5.559060567×10¹⁵
~/Desktop $ time python ex.py
real 0m3.814s
user 0m3.818s
sys 0m0.000s
'12345121,223456,123123,3234,4523,5' 3^24 = 1.66771817×10¹⁶ #X2 before
~/Desktop $ time python ex.py
real 0m5.846s
user 0m5.837s
sys 0m0.015s
'12345121,223456,123123,3234,4523,523' 3^36= 1.500946353×10¹⁷ #~10X before
~/Desktop $ time python ex.py
real 0m15.796s
user 0m15.803s
sys 0m0.008s
So to avoid this problem you can use one of the below ways:
Atomic grouping (Currently doesn't support in Python, A RFE was created to add it to Python 3)
Reduction the possibility of backtracking by breaking the nested groups to separate regexes.

To avoid the catastrophic backtracking I suggest
r'\d+(,\d+)*/$'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.