I have a big text file and I am processing it line by line using the for ... in statement:
f = open(sys.argv[1])
for line in f:
And I pass these lines through some regexes. But my code stops when this long line is being passed through a regex:
This is the line:
Mar 25 09:42:22 2011 amavis[30883]: (30883-10) Passed CLEAN, [95.0.85.202] [95.0.85.202] <oyalcin#aaa.com> -> <acanli#aaa.com.tr>,<aeren#aaa.com.tr>,<aergul#aaa.com.tr>,<dalp#aaa.com.tr>,<faks#aaa.com.tr>,<fkonyali#aaa.com.tr>,<hozsoy#aaa.com.tr>,<makcan#aaa.com.tr>,<mengin#aaa.com.tr>,<mervekayabasi#aaa.com.tr>,<muhasebe#aaa.com.tr>,<okkesgol#aaa.com.tr>,<personel#aaa.com.tr>,<skazanci#aaa.com.tr>,<sumur#aaa.com.tr>,<tkececioglu#aaa.com.tr>,<ydemirci#aaa.com.tr>,<abalcin#aaa.com>,<adanisti#aaa.com>,<akaramanli#aaa.com>,<aozsahin#aaa.com>,<benalin#aaa.com>,<cgokburun#aaa.com>,<dkilinc#aaa.com>,<gleblebici#aaa.com>,<hsannan#aaa.com>,<iziyan#aaa.com>,<kcspetrol#aaa.com>,<malakus#aaa.com>,<maltuntas#aaa.com>,<mdelice#aaa.com>,<mguclu#aaa.com>,<mkocyigit#aaa.com>,<mokuducu#aaa.com>,<mtabar#aaa.com>,<m...
And this is the regex and the place where the code stops:
pattern_clean = re.compile("(\S{3} \d{2} \d{2}\:\d{2}\:\d{2} \d{4}).*CLEAN, (LOCAL )?(\[[.\d]+\] )?(\[[.\d]+\] )?<(\S*#(\S*))?> -> (<\S*>,)* Message-ID: <(\S*)>, mail_id: (\S*), Hits: (\S*), queued_as: (\S*), (\S*)")
if pattern_clean.search(line) != None:
I have tries this script on a different file it worked okay. It also worked okay with this file too, until this line came. What may be causing this problem?
It is possible to write regular expressions that take a very, very long time to either match or fail. you've written just such a regular expression. Basically any time you see * or + nested inside another * or + be very afraid.
I think your problem may be:
(<\S*>,)*
On it's own, <\S*> will match everything up to the next whitespace, then when the full pattern fails to match it will try to shorten the match down, then the outer * means it will try lots of different combinations matching 20 emails followed by none, or 19 followed by 1, or 18 followed by 2, or 18 followed by 1 followed by 1. You've got runaway combinations there.
I suggest you try replacing all your \S occurences with a pattern that cannot match the terminating character. e.g. <[^> ]*> or [^, ]*, that may reduce the problem.
I'm no expert in Python regex, but some general observations
You have plenty of anchors, so greedy quantifiers are ok almost everywhere except in one place.
(<\S*>,)* This will not do what you think it will.
In essence, the group will only be executed 0 or 1 time, so it actually does (<\S*>,)?.
And, the match will progress like this:
<aergul#aaa.com.tr>, ... <dalp#aaa.com.tr>,
^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^
< \S* >,
Its not going to capture using an arbitrary number of capture groups.
It is functionally equivalent to (and better written as) this:
((?:<\S*?>,)+)?
I've tested (<\S*>,)* with the data sample you have provided. It works ok in Perl. But I can only assume your data extends way beyond what you are showing. Just realize your capturing everything up to the next big anchor >, Message-ID: in one expression.
It could be re-written as:
(<\S*?>,)* which just retains the last <>, in the series where parenthesis is only relavent as a non-capture grouping.
Related
What I am trying to do is match values from one file to another, but I only need to match the first portion of the string and the last portion.
I am reading each file into a list, and manipulating these based on different Regex patterns I have created. Everything works, except when it comes to these type of values:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
In this example, I only want to match 'V-1\ZDS\R\EMBO-20' and then compare the '24' value at the end of the string. The number x in '20-x:', can vary and doesn't matter in terms of comparisons, as long as the first and last parts of this string match.
This is the Regex I am using:
re.compile(r"(?:.*V-1\\ZDS\\R\\EMBO-20-\d.*)(:\d*\w.*)")
Once I filter down the list, I use the following function to return the difference between the two sets:
funcDiff = lambda x, y: list((set(x)- set(y))) + list((set(y)- set(x)))
Is there a way to take the list of differences and filter out the ones that have matching values after the
:
as mentioned above?
I apologize is this is an obvious answer, I'm new to Python and Regex!
The output I get is the differences between the entire strings, so even if the first and last part of the string match, if the number following the 'EMBO-20-x' doesn't also match, it returns it as being different.
Before discussing your question, regex101 is an incredibly useful tool for this type of thing.
Your issue stems from two issues:
1.) The way you used .*
2.) Greedy vs. Nongreedy matches
.* kinda sucks
.* is a regex expression that is very rarely what you actually want.
As a quick aside, a useful regex expression is [^c]* or [^c]+. These expressions match any character except the letter c, with the first expression matching 0 or more, and the second matched 1 or more.
.* will match all characters as many times as it can. Instead, try to start your regex patterns with more concrete starting points. Two good ways to do this are lookbehind expressions and anchors.
Another quick aside, it's likely that you are misusing regex.match and regex.find. match will only return a match that begins at the start of the string, while find will return matches anywhere in the input string. This could be the reason you included the .* in the first place, to allow a .match call to return a match deeper in the string.
Lookbehind Expressions
There are more complete explanations online, but in short, regex patterns like:
(?<=test)foo
will match the text foo, but only if test is right in front of it. To be more clear, the following strings will not match that regex:
foo
test-foo
test foo
but the following string will match:
testfoo
This will only match the text foo, though.
Anchors
Another option is anchors. ^ and $ are special characters, matching the start and end of a line of text. If you know your regex pattern will match exactly one line of text, start it with ^ and end it with $.
Leading patterns with .* and ending with .* are likely the source of your issue. Although you did not include full examples of your input or your code, you likely used match as opposed to find.
In regex, . matches any character, and * means 0 or more times. This means that for any input, your pattern will match the entire string.
Greedy vs. Non-Greedy qualifiers
The second issue is related to greediness. When your regex patterns have a * in them, they can match 0 or more characters. This can hide problems, as entire * expressions can be skipped. Your regex is likely matched several lines of text as one match, and hiding multiple records in a single .*.
The Actual Answer
Taking all of this in to consideration, let's assume that your input data looks like this:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
V-1\ZDS\R\EMBO-20-3:93
V-1\ZDS\R\EMBO-20-6:22309
V-1\ZDS\R\EMBO-20-8:2238
V-1\ZDS\R\EMBO-20-3:28
A better regular expression would be:
^V-1\\ZDS\\R\\EMBO-20-\d:(\d+)$
To visualize this regex in action, follow this link.
There are several differences I would like to highlight:
Starting the expression with ^ and ending with $. This forces the regex to match exactly one line. Even though the pattern works without these characters, it's good practice when working with regex to be as explicit as possible.
No useless non-capturing group. Your example had a (?:) group at the start. This denotes a group that does not capture it's match. It's useful if you want to match a subpattern multiple times ((?:ab){5} matches ababababab without capturing anything). However, in your example, it did nothing :)
Only capturing the number. This makes it easier to extract the value of the capture groups.
No use of *, one use of +. + works like *, but it matches 1 or more. This is often more correct, as it prevents 'skipping' entire characters.
This code snippet intends to search for a regex match on each line of the supplied file. re.search() gets hung at a line containing "#" character 3e+5 times in the file.
What could be the solution to this problem?
import re
print "Started..."
exp = "(.*)\$\$\$Uniqueterm:(.*)"
with open("sample.txt", 'r') as file:
for line in file:
if re.search(exp, line):
print "Found match: " + re.search(exp,line).groups()[1].strip()
print "File finished..."
Sample input file (sample.txt):
abc
pqr
##### (3e+5 times '#' in a single line)
xyz
$$$Uniqueterm: Match it
qaz
Expected Output:
Match it
You're using re.search with a regex that starts with (.*). re.search looks for a match at any starting position, meaning it has to start a search from every possible starting index until it finds a match or runs out of positions to search from. The leading (.*) forces a scan of the entire string starting from the search start position, for every starting position.
It's classic catastrophic backtracking, just with part of the backtracking implicit in the use of re.search instead of built into the regex itself. You could adjust the regex to eliminate the catastrophic backtracking, but why use a regex at all? Basic methods like str.split or str.find can do the job just fine. Jean-François Fabre's answer shows one way to do it.
regex engine can reach high complexties, specially when it must backtrack, like in your case.
So when the expression to search for is long and you have groups that must be computed with a lot of trial & error (i.e. backtracking), the search can take ages (see a famous example of regex failure on StackOverflow network).
On July 20, 2016 we experienced a 34 minute outage starting at 14:44 UTC. It took 10 minutes to identify the cause, 14 minutes to write the code to fix it, and 10 minutes to roll out the fix to a point where Stack Overflow became available again.
The direct cause was a malformed post that caused one of our regular expressions to consume high CPU on our web servers. The post was in the homepage list, and that caused the expensive regular expression to be called on each home page view.
...
This regular expression has been replaced with a substring function.
I'd propose a workaround since you don't really need regex here, using str.split would do and is fast, since it just searches for the substring (O(N) approach) then creates 2 strings, which is equivalent of what you're trying to do with regular expressions:
a = "foo$$$Uniqueterm:bar"
g1,g2 = a.split("$$$Uniqueterm:")
print(g1,g2)
result
foo bar
first time posting, I've lurked for a little while, really excited about the helpful community here.
So, working with "Automate the boring stuff" by Al Sweigart
Doing an exercise that requires I build a regex that finds numbers in standard number format. Three digit, comma, three digits, comma, etc...
So hopefully will match 1,234 and 23,322 and 1,234,567 and 12 but not 1,23,1 or ,,1111, or anything else silly.
I have the following.
import re
testStr = '1,234,343'
matches = []
numComma = re.compile(r'^(\d{1,3})*(,\d{3})*$')
for group in numComma.findall(str(testStr)):
Num = group
print(str(Num) + '-') #Printing here to test each loop
matches.append(str(Num[0]))
#if len(matches) > 0:
# print(''.join(matches))
Which outputs this....
('1', ',343')-
I'm not sure why the middle ",234" is being skipped over. Something wrong with the regex, I'm sure. Just can't seem to wrap my head around this one.
Any help or explanation would be appreciated.
FOLLOW UP EDIT. So after following all your advice that I could assimilate, I got it to work perfectly for several inputs.
import re
testStr = '1,234,343'
numComma = re.compile(r'^(?:\d{1,3})(?:,\d{3})*$')
Num = numComma.findall(testStr)
print(Num)
gives me....
['1,234,343']
Great! BUT! What about when I change the string input to something like
'1,234,343 and 12,345'
Same code returns....
[]
Grrr... lol, this is fun, I must admit.
So the purpose of the exercise is to be able to eventually scan a block of text and pick out all the numbers in this format. Any insight? I thought this would add an additional tuple, not return an empty one...
FOLLOW UP EDIT:
So, a day later(Been busy with 3 daughters and Honey-do lists), I've finally been able to sit down and examine all the help I've received. Here's what I've come up with, and it appears to work flawlessly. Included comments for my own personal understanding. Thanks again for everything, Blckknght, Saleem, mhawke, and BHustus.
My final code:
import re
testStr = '12,454 So hopefully will match 1,234 and 23,322 and 1,234,567 and 12 but not 1,23,1 or ,,1111, or anything else silly.'
numComma = re.compile(r'''
(?:(?<=^)|(?<=\s)) # Looks behind the Match for start of line and whitespace
((?:\d{1,3}) # Matches on groups of 1-3 numbers.
(?:,\d{3})*) # Matches on groups of 3 numbers preceded by a comma
(?=\s|$)''', re.VERBOSE) # Looks ahead of match for end of line and whitespace
Num = numComma.findall(testStr)
print(Num)
Which returns:
['12,454', '1,234', '23,322', '1,234,567', '12']
Thanks again! I have had such a positive first posting experience here, amazing. =)
The issue is due to the fact you're using a repeated capturing group, (,\d{3})* in your pattern. Python's regex engine will match that against both the thousands and ones groups of your number, but only the last repetition will be captured.
I suspect you want to use non-capturing groups instead. Add ?: to the start of each set of parentheses (I'd also recommend, on general principle, to use a raw string, though you don't have escaping issues in your current pattern):
numComma = re.compile(r'^(?:\d{1,3})(?:,\d{3})*$')
Since there are no groups being captured, re.findall will return the whole matched text, which I think is what you wanted. You can also use re.find or re.search and call the group() method on the returned match object to get the whole matched text.
The problem is:
A regex match will return a tuple item for each group. However, it is important to distinguish a group from a capture. Since you only have two parenthese-delimited groups, the matches will always be tuples of two: the first group, and the second. But the second group matches twice.
1: first group, captured
,234: second group, captured
,343: also second group, which means it overwrites ,234.
Unfortunately, it seems that vanilla Python does not have a way to access any captures of a group other than the last one in a manner similar to .NET's regex implementation. However, if you are only interested in getting the specific number, your best bet would be to use re.search(number). If it returns a non-None value, then the input string is a valid number. Otherwise, it is not.
Additionally: A test on your regex. Note that, as Paul Hankin stated, test cases 6 and 7 match even though they shouldn't, due to the first * following the first capturing group, which will make the initial group match any number of times. Otherwise, your regex is correct. Fixed version.
RESPONSE TO EDIT:
The reason now that your regex returns an empty set on ' and ' is because of the ^ and $ anchors in your regex. The ^ anchor, at the start of the regex, says 'this point needs to be at the start of a string'. The $ is its counterpart, saying 'This needs to be at the end of the string'. This is good if you want your entire string from start to end to match the pattern, but if you want to pick out multiple numbers, you should do away with them.
HOWEVER!
If you leave the regex in its current form sans anchors, it will now match the individual elements of 1,23,45 as separate numbers. So for this we need to add a zero-width positive lookahead assertion and say, 'make sure that after this number is either whitespace or the end of a line'. You can see the change here. The tail end, (?=\s|$), is our lookahead assertion: it doesn't capture anything, but just makes sure criteria or met, in this case whitespace (\s) or (|) the end of a line ($).
BUT: In a similar vein, the previous regex would have matched 2 onward in "1234,567", giving us the number "234,567", which would be bad. So we use a lookbehind assertion similar to our lookahead at the end: (?<!^|\s), only match if at the beginning of the string or there is whitespace before the number. This version can be found here, and should soundly satisfy any non-decimal number related needs.
Try:
import re
p = re.compile(ur'(?:(?<=^)|(?<=\s))((?:\d{1,3})(?:,\d{3})*)(?=\s|$)', re.DOTALL)
test_str = """1,234 and 23,322 and 1,234,567 1,234,567,891 200 and 12 but
not 1,23,1 or ,,1111, or anything else silly"""
for m in re.findall(p, test_str):
print m
and it's output will be
1,234
23,322
1,234,567
1,234,567,891
200
12
You can see demo here
This regex, would match any valid number, and would never match an invalid number:
(?<=^|\s)(?:(?:0|[1-9][0-9]{0,2}(?:,[0-9]{3})*))(?=\s|$)
https://regex101.com/r/dA4yB1/1
I have the following regex to detect start and end script tags in the html file:
<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:[^s]))*)</script>
meaning in short it will catch: <script "NOT THIS</s" > "NOT THIS</s" </script>
it works but needs really long time to detect <script>,
even minutes or hours for long strings
The lite version works perfectly even for long string:
<script[^<]*>[^<]*</script>
however, the extended pattern I use as well for other tags like <a> where < and > are possible to appears also as values of attributes.
python test:
import re
pattern = re.compile('<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:^s]))*)</script>', re.I + re.DOTALL)
re.search(pattern, '11<script type="text/javascript"> easy>example</script>22').group()
re.search(pattern, '<script type="text/javascript">' + ('hard example' * 50) + '</script>').group()
how can I fix it?
The inner part of regex (after <script>) should be changed and simplified.
PS :) Anticipate your answers about the wrong approach like using regex in html parsing,
I know very well many html/xml parsers, and what I can expect in often broken html code, and regex is really useful here.
comment:
well, I need to handle:
each <a < document like this.border="5px;">
and approach is to use parsers and regex together
BeautifulSoup is only 2k lines, which not handling every html and just extends regex from sgmllib.
and the main reason is that I must know exact the position where every tag starts and stop. and every broken html must be handled.
BS is not perfect, sometimes happens:
BeautifulSoup('< scriPt\n\n>a<aa>s< /script>').findAll('script') == []
#Cylian:
atomic grouping as you know is not available in python's re.
so non-geedy everything .*? until <\s/\stag\s*>** is a winner at this time.
I know that is not perfect in that case:
re.search('<\sscript.?<\s*/\sscript\s>','< script </script> shit </script>').group()
but I can handle refused tail in the next parsing.
It's pretty obvious that html parsing with regex is not one battle figthing.
Use an HTML parser like beautifulsoup.
See the great answers for "Can I remove script tags with beautifulsoup?".
If your only tool is a hammer, every problem starts looking like a nail. Regular expressions are a powerful hammer but not always the best solution for some problems.
I guess you want to remove scripts from HTML posted by users for security reasons. If security is the main concern, regular expressions are hard to implement because there are so many things a hacker can modify to fool your regex, yet most browsers will happily evaluate... An specialized parser is easier to use, performs better and is safer.
If you are still thinking "why can't I use regex", read this answer pointed by mayhewr's comment. I could not put it better, the guy nailed it, and his 4433 upvotes are well deserved.
I don't know python, but I know regular expressions:
if you use the greedy/non-greedy operators you get a much simpler regex:
<script.*?>.*?</script>
This is assuming there are no nested scripts.
The problem in pattern is that it is backtracking. Using atomic groups this issue could be solved. Change your pattern to this**
<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>
^^^^^ ^^^^^
Explanation
<!--
<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>
Match the characters “<script” literally «<script»
Python does not support atomic grouping «(?>[^<]+?|<(?:[^/]|/(?:[^s])))*»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+?»
Match any character that is NOT a “<” «[^<]+?»
Between one and unlimited times, as few times as possible, expanding as needed (lazy) «+?»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))»
Match the character “<” literally «<»
Match the regular expression below «(?:[^/]|/(?:[^s]))»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
Match any character that is NOT a “/” «[^/]»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
Match the character “/” literally «/»
Match the regular expression below «(?:[^s])»
Match any character that is NOT a “s” «[^s]»
Match the character “>” literally «>»
Python does not support atomic grouping «(?>[^<]+|<(?:[^/]|/(?:[^s]))*)»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+»
Match any character that is NOT a “<” «[^<]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))*»
Match the character “<” literally «<»
Match the regular expression below «(?:[^/]|/(?:[^s]))*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
Match any character that is NOT a “/” «[^/]»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
Match the character “/” literally «/»
Match the regular expression below «(?:[^s])»
Match any character that is NOT a “s” «[^s]»
Match the characters “</script>” literally «</script>»
-->
I have a somewhat complex regular expression which I'm trying to match against a long string (65,535 characters). I'm looking for multiple occurrences of the re in the string, and so am using finditer. It works, but for some reason it hangs after identifying the first few occurrences. Does anyone know why this might be? Here's the code snippet:
pattern = "(([ef]|([gh]d*(ad*[gh]d)*b))d*b([ef]d*b|d*)*c)"
matches = re.finditer(pattern, string)
for match in matches:
print "(%d-%d): %s" % (match.start(), match.end(), match.group())
It prints out the first four occurrences, but then it hangs. When I kill it using Ctrl-C, it tells me it was killed in the iterator:
Traceback (most recent call last):
File "code.py", line 133, in <module>
main(sys.argv[1:])
File "code.py", line 106, in main
for match in matches:
KeyboardInterrupt
If I try it with a simpler re, it works fine.
I'm running this on python 2.5.4 running on Cygwin on Windows XP.
I managed to get it to hang with a very much shorter string. With this 50 character string, it never returned after about 5 minutes:
ddddddeddbedddbddddddddddddddddddddddddddddddddddd
With this 39 character string it took about 15 seconds to return (and display no matches):
ddddddeddbedddbdddddddddddddddddddddddd
And with this string it returns instantly:
ddddddeddbedddbdddddddddddddd
Definitely exponential behaviour. You've got so many d* parts to your regexp that it'll be backtracking like crazy when it gets to the long string of d's, but fails to match something earlier. You need to rethink the regexp, so it has less possible paths to try.
In particular I think:
([ef]d\*b|d\*)*</pre></code> and <code><pre>([ef]|([gh]d\*(ad\*[gh]d)\*b))d\*b
Might need rethinking, as they'll force a retry of the alternate match. Plus they also overlap in terms of what they match. They'd both match edb for example, but if one fails and tries to backtrack the other part will probably have the same behaviour.
So in short try not to use the | if you can and try to make sure the patterns don't overlap where possible.
Could it be that your expression triggers exponential behavior in the Python RE engine?
This article deals with the problem. If you have the time, you might want to try running your expression in an RE engine developed using those ideas.
Thanks to all the responses, which were very helpful. In the end, surprisingly, it was easy to speed it up. Here's the original regex:
(([ef]|([gh]d*(ad*[gh]d)*b))d*b([ef]d*b|d*)*c)
I noticed that the |d* near the end was not really what I needed, so I modified it as follows:
(([ef]|([gh]d*(ad*[gh]d)*b))d*b([ef]d*bd*)*c)
Now it works almost instantaneously on the 65,536 character string. I guess now I just have to make sure that the regex is really matching the strings I need it to match...
I think you experience what is known as "catastrophic backtracking".
Your regex has many optional/alternative parts, all of which still try to match, so previous sub-expressions give back characters to the following expression on local failure. This leads to a back-and-fourth behavior within the regex and exponentially rising execution times.
Python (2.7+?, I'm not sure) supports atomic grouping and possessive quantifiers, you could examine your regex to identify the parts that should match or fail as a whole. Unnecessary backtracking can be brought under control with that.
catastrophic backtracking!
Regular Expressions can be very expensive. Certain (unintended and intended) strings may cause RegExes to exhibit exponential behavior. We've taken several hotfixes for this. RegExes are so handy, but devs really need to understand how they work; we've gotten bitten by them.
example and debugger:
http://www.codinghorror.com/blog/archives/000488.html
You already gave yourself the answer: The regular expression is to complex and ambiguous.
You should try to find a less complex and more distinct expression that is easier to process. Or tell us what you want to accomplish and we could try to help you to find one.
Edit If you just want to allow ds in every position as you said in a comment to John Montgomery’s answer, you should remove them before testing the pattern:
import re
string = "ddddddeddbedddbddddddddddddddddddddddddddddddddddd"
pattern = "(([ef]|([gh](a[gh])*b))b([ef]b)*c)"
matches = re.finditer(pattern, re.sub("d+", "", string))
for match in matches:
print "(%d-%d): %s" % (match.start(), match.end(), match.group())