This code snippet intends to search for a regex match on each line of the supplied file. re.search() gets hung at a line containing "#" character 3e+5 times in the file.
What could be the solution to this problem?
import re
print "Started..."
exp = "(.*)\$\$\$Uniqueterm:(.*)"
with open("sample.txt", 'r') as file:
for line in file:
if re.search(exp, line):
print "Found match: " + re.search(exp,line).groups()[1].strip()
print "File finished..."
Sample input file (sample.txt):
abc
pqr
##### (3e+5 times '#' in a single line)
xyz
$$$Uniqueterm: Match it
qaz
Expected Output:
Match it
You're using re.search with a regex that starts with (.*). re.search looks for a match at any starting position, meaning it has to start a search from every possible starting index until it finds a match or runs out of positions to search from. The leading (.*) forces a scan of the entire string starting from the search start position, for every starting position.
It's classic catastrophic backtracking, just with part of the backtracking implicit in the use of re.search instead of built into the regex itself. You could adjust the regex to eliminate the catastrophic backtracking, but why use a regex at all? Basic methods like str.split or str.find can do the job just fine. Jean-François Fabre's answer shows one way to do it.
regex engine can reach high complexties, specially when it must backtrack, like in your case.
So when the expression to search for is long and you have groups that must be computed with a lot of trial & error (i.e. backtracking), the search can take ages (see a famous example of regex failure on StackOverflow network).
On July 20, 2016 we experienced a 34 minute outage starting at 14:44 UTC. It took 10 minutes to identify the cause, 14 minutes to write the code to fix it, and 10 minutes to roll out the fix to a point where Stack Overflow became available again.
The direct cause was a malformed post that caused one of our regular expressions to consume high CPU on our web servers. The post was in the homepage list, and that caused the expensive regular expression to be called on each home page view.
...
This regular expression has been replaced with a substring function.
I'd propose a workaround since you don't really need regex here, using str.split would do and is fast, since it just searches for the substring (O(N) approach) then creates 2 strings, which is equivalent of what you're trying to do with regular expressions:
a = "foo$$$Uniqueterm:bar"
g1,g2 = a.split("$$$Uniqueterm:")
print(g1,g2)
result
foo bar
Related
I need to match certain URLs in web application, i.e. /123,456,789, and wrote this regex to match the pattern:
r'(\d+(,)?)+/$'
I noticed that it does not seem to evaluate, even after several minutes when testing the pattern:
re.findall(r'(\d+(,)?)+/$', '12345121,223456,123123,3234,4523,523523')
The expected result would be that there were no matches.
This expression, however, executes almost immediately (note the trailing slash):
re.findall(r'(\d+(,)?)+/$', '12345121,223456,123123,3234,4523,523523/')
Is this a bug?
There is some catastrophic backtracking going on that will cause an exponential amount of processing depending on how long the non-match string is. This has to do with your nested repetitions and optional comma (even though some regex engines can determine that this wouldn't be a match with attempting all of the extraneous repetition). This is solved by optimizing the expression.
The easiest way to accomplish this is to just look for 1+ digits or commas followed by a slash and the end of the string: [\d,]+/$. However, that is not perfect since it would allow for something like ,123,,4,5/.
For this you can use a slightly optimized version of your initial try: (?:\d,?)+/$. First, I made your repeating group non-capturing ((?:...)) which isn't necessary but it provides for a "cleaner match". Next, and the only crucial step, I stopped repeating the \d inside of the group since the group is already repeating. Finally, I removed the unnecessary group around the optional , since ? only affects the last character. Pretty much this will look for one digit, maybe a comma, then repeat, and finally followed by a trailing /.
This can still match an odd string 1,2,3,/, so for the heck of it I improved your original regex with a negative lookbehind: (?:\d,?)+(?<!,)/$. This will assert that there is no comma directly before the trailing /.
First off, I must say that this is not a BUG. What your regex is doing is that it's trying all the possibilities due to the nested repeating patters. Sometimes this process can gobble up a lot of time and resources and when it gets really bad, it’s called catastrophic backtracking.
This is the code of findall function in python source code:
def findall(pattern, string, flags=0):
"""Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result."""
return _compile(pattern, flags).findall(string)
As you see it just use the compile() function, so based on _compile() function that actually use Traditional NFA that python use for its regex matching, and base on
this short explain about backtracking in regular expression in Mastering Regular Expressions, Third Edition, by Jeffrey E. F. Friedl!
The essence of an NFA engine is this: it considers each subexpression or component in turn, and whenever it needs to decide between two equally viable options,
it selects one and remembers the other to return to later if need be.
Situations where it has to decide among courses of action include anything with a
quantifier (decide whether to try another match), and alternation (decide which
alter native to try, and which to leave for later).
Whichever course of action is attempted, if it’s successful and the rest of the regex
is also successful, the match is finished. If anything in the rest of the regex eventually causes failure, the regex engine knows it can backtrack to where it chose the
first option, and can continue with the match by trying the other option. This way,
it eventually tries all possible permutations of the regex (or at least as many as
needed until a match is found).
Let's go inside your pattern: So you have r'(\d+(,)?)+/$' with this string '12345121,223456,123123,3234,4523,523523' we have this steps:
At first, the first part of your string (12345121) is matched with \d+, then , is matched with (,)? .
Then based on first step, the whole string is match due to + after the grouping ((\d+(,)?)+)
Then at the end, there is nothing for /$ to be matched. Therefore, (\d+(,)?)+ needs to "backtrack" to one character before the last for check for /$. Again, it don't find any proper match, so after that it's (,)'s turn to backtrack, then \d+ will backtrack, and this backtracking will be continue to end till it return None.
So based on the length of your string it takes time, which in this case is very high, and it create a nested quantifiers entirely!
As an approximately benchmarking, in this case, you have 39 character so you need 3^39 backtracking attempts (we have 3 methods for backtrack).
Now for better understanding, I measure the runtime of the program while changing the length of the string:
'12345121,223456,123123,3234,4523,' 3^33 = 5.559060567×10¹⁵
~/Desktop $ time python ex.py
real 0m3.814s
user 0m3.818s
sys 0m0.000s
'12345121,223456,123123,3234,4523,5' 3^24 = 1.66771817×10¹⁶ #X2 before
~/Desktop $ time python ex.py
real 0m5.846s
user 0m5.837s
sys 0m0.015s
'12345121,223456,123123,3234,4523,523' 3^36= 1.500946353×10¹⁷ #~10X before
~/Desktop $ time python ex.py
real 0m15.796s
user 0m15.803s
sys 0m0.008s
So to avoid this problem you can use one of the below ways:
Atomic grouping (Currently doesn't support in Python, A RFE was created to add it to Python 3)
Reduction the possibility of backtracking by breaking the nested groups to separate regexes.
To avoid the catastrophic backtracking I suggest
r'\d+(,\d+)*/$'
I have the following regex to detect start and end script tags in the html file:
<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:[^s]))*)</script>
meaning in short it will catch: <script "NOT THIS</s" > "NOT THIS</s" </script>
it works but needs really long time to detect <script>,
even minutes or hours for long strings
The lite version works perfectly even for long string:
<script[^<]*>[^<]*</script>
however, the extended pattern I use as well for other tags like <a> where < and > are possible to appears also as values of attributes.
python test:
import re
pattern = re.compile('<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:^s]))*)</script>', re.I + re.DOTALL)
re.search(pattern, '11<script type="text/javascript"> easy>example</script>22').group()
re.search(pattern, '<script type="text/javascript">' + ('hard example' * 50) + '</script>').group()
how can I fix it?
The inner part of regex (after <script>) should be changed and simplified.
PS :) Anticipate your answers about the wrong approach like using regex in html parsing,
I know very well many html/xml parsers, and what I can expect in often broken html code, and regex is really useful here.
comment:
well, I need to handle:
each <a < document like this.border="5px;">
and approach is to use parsers and regex together
BeautifulSoup is only 2k lines, which not handling every html and just extends regex from sgmllib.
and the main reason is that I must know exact the position where every tag starts and stop. and every broken html must be handled.
BS is not perfect, sometimes happens:
BeautifulSoup('< scriPt\n\n>a<aa>s< /script>').findAll('script') == []
#Cylian:
atomic grouping as you know is not available in python's re.
so non-geedy everything .*? until <\s/\stag\s*>** is a winner at this time.
I know that is not perfect in that case:
re.search('<\sscript.?<\s*/\sscript\s>','< script </script> shit </script>').group()
but I can handle refused tail in the next parsing.
It's pretty obvious that html parsing with regex is not one battle figthing.
Use an HTML parser like beautifulsoup.
See the great answers for "Can I remove script tags with beautifulsoup?".
If your only tool is a hammer, every problem starts looking like a nail. Regular expressions are a powerful hammer but not always the best solution for some problems.
I guess you want to remove scripts from HTML posted by users for security reasons. If security is the main concern, regular expressions are hard to implement because there are so many things a hacker can modify to fool your regex, yet most browsers will happily evaluate... An specialized parser is easier to use, performs better and is safer.
If you are still thinking "why can't I use regex", read this answer pointed by mayhewr's comment. I could not put it better, the guy nailed it, and his 4433 upvotes are well deserved.
I don't know python, but I know regular expressions:
if you use the greedy/non-greedy operators you get a much simpler regex:
<script.*?>.*?</script>
This is assuming there are no nested scripts.
The problem in pattern is that it is backtracking. Using atomic groups this issue could be solved. Change your pattern to this**
<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>
^^^^^ ^^^^^
Explanation
<!--
<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>
Match the characters “<script” literally «<script»
Python does not support atomic grouping «(?>[^<]+?|<(?:[^/]|/(?:[^s])))*»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+?»
Match any character that is NOT a “<” «[^<]+?»
Between one and unlimited times, as few times as possible, expanding as needed (lazy) «+?»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))»
Match the character “<” literally «<»
Match the regular expression below «(?:[^/]|/(?:[^s]))»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
Match any character that is NOT a “/” «[^/]»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
Match the character “/” literally «/»
Match the regular expression below «(?:[^s])»
Match any character that is NOT a “s” «[^s]»
Match the character “>” literally «>»
Python does not support atomic grouping «(?>[^<]+|<(?:[^/]|/(?:[^s]))*)»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+»
Match any character that is NOT a “<” «[^<]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))*»
Match the character “<” literally «<»
Match the regular expression below «(?:[^/]|/(?:[^s]))*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
Match any character that is NOT a “/” «[^/]»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
Match the character “/” literally «/»
Match the regular expression below «(?:[^s])»
Match any character that is NOT a “s” «[^s]»
Match the characters “</script>” literally «</script>»
-->
I have a big text file and I am processing it line by line using the for ... in statement:
f = open(sys.argv[1])
for line in f:
And I pass these lines through some regexes. But my code stops when this long line is being passed through a regex:
This is the line:
Mar 25 09:42:22 2011 amavis[30883]: (30883-10) Passed CLEAN, [95.0.85.202] [95.0.85.202] <oyalcin#aaa.com> -> <acanli#aaa.com.tr>,<aeren#aaa.com.tr>,<aergul#aaa.com.tr>,<dalp#aaa.com.tr>,<faks#aaa.com.tr>,<fkonyali#aaa.com.tr>,<hozsoy#aaa.com.tr>,<makcan#aaa.com.tr>,<mengin#aaa.com.tr>,<mervekayabasi#aaa.com.tr>,<muhasebe#aaa.com.tr>,<okkesgol#aaa.com.tr>,<personel#aaa.com.tr>,<skazanci#aaa.com.tr>,<sumur#aaa.com.tr>,<tkececioglu#aaa.com.tr>,<ydemirci#aaa.com.tr>,<abalcin#aaa.com>,<adanisti#aaa.com>,<akaramanli#aaa.com>,<aozsahin#aaa.com>,<benalin#aaa.com>,<cgokburun#aaa.com>,<dkilinc#aaa.com>,<gleblebici#aaa.com>,<hsannan#aaa.com>,<iziyan#aaa.com>,<kcspetrol#aaa.com>,<malakus#aaa.com>,<maltuntas#aaa.com>,<mdelice#aaa.com>,<mguclu#aaa.com>,<mkocyigit#aaa.com>,<mokuducu#aaa.com>,<mtabar#aaa.com>,<m...
And this is the regex and the place where the code stops:
pattern_clean = re.compile("(\S{3} \d{2} \d{2}\:\d{2}\:\d{2} \d{4}).*CLEAN, (LOCAL )?(\[[.\d]+\] )?(\[[.\d]+\] )?<(\S*#(\S*))?> -> (<\S*>,)* Message-ID: <(\S*)>, mail_id: (\S*), Hits: (\S*), queued_as: (\S*), (\S*)")
if pattern_clean.search(line) != None:
I have tries this script on a different file it worked okay. It also worked okay with this file too, until this line came. What may be causing this problem?
It is possible to write regular expressions that take a very, very long time to either match or fail. you've written just such a regular expression. Basically any time you see * or + nested inside another * or + be very afraid.
I think your problem may be:
(<\S*>,)*
On it's own, <\S*> will match everything up to the next whitespace, then when the full pattern fails to match it will try to shorten the match down, then the outer * means it will try lots of different combinations matching 20 emails followed by none, or 19 followed by 1, or 18 followed by 2, or 18 followed by 1 followed by 1. You've got runaway combinations there.
I suggest you try replacing all your \S occurences with a pattern that cannot match the terminating character. e.g. <[^> ]*> or [^, ]*, that may reduce the problem.
I'm no expert in Python regex, but some general observations
You have plenty of anchors, so greedy quantifiers are ok almost everywhere except in one place.
(<\S*>,)* This will not do what you think it will.
In essence, the group will only be executed 0 or 1 time, so it actually does (<\S*>,)?.
And, the match will progress like this:
<aergul#aaa.com.tr>, ... <dalp#aaa.com.tr>,
^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^
< \S* >,
Its not going to capture using an arbitrary number of capture groups.
It is functionally equivalent to (and better written as) this:
((?:<\S*?>,)+)?
I've tested (<\S*>,)* with the data sample you have provided. It works ok in Perl. But I can only assume your data extends way beyond what you are showing. Just realize your capturing everything up to the next big anchor >, Message-ID: in one expression.
It could be re-written as:
(<\S*?>,)* which just retains the last <>, in the series where parenthesis is only relavent as a non-capture grouping.
I have a regular expression to find :ABC:`hello` pattern. This is the code.
format =r".*\:(.*)\:\`(.*)\`"
patt = re.compile(format, re.I|re.U)
m = patt.match(l.rstrip())
if m:
...
It works well when the pattern happens once in a line, but with an example ":tagbox:`Verilog` :tagbox:`Multiply` :tagbox:`VHDL`". It finds only the last one.
How can I find all the three patterns?
EDIT
Based on Paul Z's answer, I could get it working with this code
format = r"\:([^:]*)\:\`([^`]*)\`"
patt = re.compile(format, re.I|re.U)
for m in patt.finditer(l.rstrip()):
tag, value = m.groups()
print tag, ":::", value
Result
tagbox ::: Verilog
tagbox ::: Multiply
tagbox ::: VHDL
Yeah, dcrosta suggested looking at the re module docs, which is probably a good idea, but I'm betting you actually wanted the finditer function. Try this:
format = r"\:(.*)\:\`(.*)\`"
patt = re.compile(format, re.I|re.U)
for m in patt.finditer(l.rstrip()):
tag, value = m.groups()
....
Your current solution always finds the last one because the initial .* eats as much as it can while still leaving a valid match (the last one). Incidentally this is also probably making your program incredibly slower than it needs to be, because .* first tries to eat the entire string, then backs up character by character as the remaining expression tells it "that was too much, go back". Using finditer should be much more performant.
A good place to start is there module docs. In addition to re.match (which searches starting explicitly at the beginning of the string), there is re.findall (finds all non-overlapping occurrences of the pattern), and the methods match and search of compiled RegexObjects, both of which accept start and end positions to limit the portion of the string being considered. See also split, which returns a list of substrings, split by the pattern. Depending on how you want your output, one of these may help.
re.findall or even better regex.findall can do that for you in a single line:
import regex as re #or just import re
s = ":tagbox:`Verilog` :tagbox:`Multiply` :tagbox:`VHDL`"
format = r"\:([^:]*)\:\`([^`]*)\`"
re.findall(format,s)
result is:
[('tagbox', 'Verilog'), ('tagbox', 'Multiply'), ('tagbox', 'VHDL')]
I have a somewhat complex regular expression which I'm trying to match against a long string (65,535 characters). I'm looking for multiple occurrences of the re in the string, and so am using finditer. It works, but for some reason it hangs after identifying the first few occurrences. Does anyone know why this might be? Here's the code snippet:
pattern = "(([ef]|([gh]d*(ad*[gh]d)*b))d*b([ef]d*b|d*)*c)"
matches = re.finditer(pattern, string)
for match in matches:
print "(%d-%d): %s" % (match.start(), match.end(), match.group())
It prints out the first four occurrences, but then it hangs. When I kill it using Ctrl-C, it tells me it was killed in the iterator:
Traceback (most recent call last):
File "code.py", line 133, in <module>
main(sys.argv[1:])
File "code.py", line 106, in main
for match in matches:
KeyboardInterrupt
If I try it with a simpler re, it works fine.
I'm running this on python 2.5.4 running on Cygwin on Windows XP.
I managed to get it to hang with a very much shorter string. With this 50 character string, it never returned after about 5 minutes:
ddddddeddbedddbddddddddddddddddddddddddddddddddddd
With this 39 character string it took about 15 seconds to return (and display no matches):
ddddddeddbedddbdddddddddddddddddddddddd
And with this string it returns instantly:
ddddddeddbedddbdddddddddddddd
Definitely exponential behaviour. You've got so many d* parts to your regexp that it'll be backtracking like crazy when it gets to the long string of d's, but fails to match something earlier. You need to rethink the regexp, so it has less possible paths to try.
In particular I think:
([ef]d\*b|d\*)*</pre></code> and <code><pre>([ef]|([gh]d\*(ad\*[gh]d)\*b))d\*b
Might need rethinking, as they'll force a retry of the alternate match. Plus they also overlap in terms of what they match. They'd both match edb for example, but if one fails and tries to backtrack the other part will probably have the same behaviour.
So in short try not to use the | if you can and try to make sure the patterns don't overlap where possible.
Could it be that your expression triggers exponential behavior in the Python RE engine?
This article deals with the problem. If you have the time, you might want to try running your expression in an RE engine developed using those ideas.
Thanks to all the responses, which were very helpful. In the end, surprisingly, it was easy to speed it up. Here's the original regex:
(([ef]|([gh]d*(ad*[gh]d)*b))d*b([ef]d*b|d*)*c)
I noticed that the |d* near the end was not really what I needed, so I modified it as follows:
(([ef]|([gh]d*(ad*[gh]d)*b))d*b([ef]d*bd*)*c)
Now it works almost instantaneously on the 65,536 character string. I guess now I just have to make sure that the regex is really matching the strings I need it to match...
I think you experience what is known as "catastrophic backtracking".
Your regex has many optional/alternative parts, all of which still try to match, so previous sub-expressions give back characters to the following expression on local failure. This leads to a back-and-fourth behavior within the regex and exponentially rising execution times.
Python (2.7+?, I'm not sure) supports atomic grouping and possessive quantifiers, you could examine your regex to identify the parts that should match or fail as a whole. Unnecessary backtracking can be brought under control with that.
catastrophic backtracking!
Regular Expressions can be very expensive. Certain (unintended and intended) strings may cause RegExes to exhibit exponential behavior. We've taken several hotfixes for this. RegExes are so handy, but devs really need to understand how they work; we've gotten bitten by them.
example and debugger:
http://www.codinghorror.com/blog/archives/000488.html
You already gave yourself the answer: The regular expression is to complex and ambiguous.
You should try to find a less complex and more distinct expression that is easier to process. Or tell us what you want to accomplish and we could try to help you to find one.
Edit If you just want to allow ds in every position as you said in a comment to John Montgomery’s answer, you should remove them before testing the pattern:
import re
string = "ddddddeddbedddbddddddddddddddddddddddddddddddddddd"
pattern = "(([ef]|([gh](a[gh])*b))b([ef]b)*c)"
matches = re.finditer(pattern, re.sub("d+", "", string))
for match in matches:
print "(%d-%d): %s" % (match.start(), match.end(), match.group())