Regex Too Slow Need To Optimize

Regex Too Slow Need To Optimize - python

I am using a regex expression like "a.{1000000}b.{1000000}c" to pattern match on a string. However this is WAY too slow. Is there a better way to do this? I am not interested in the stuff between a, b and c, as long as their gap is of my specified size I care not of the content within. One can think of it as skipping n characters. Checking the index doesn't serve me well either, I need to be using some built-in method written in C. Any suggestions?
Thanks in advance

If you just need to verify that a string is in a given pattern and do not care to extract the a, b, nor c then this would work:
(?=^a.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}b.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}.{50000}c$)
The limit for regex quantifiers is 65535 so if you need one million then you would have to repeat .{50000} 20 times like I did above.
Now you just need to make Python code that says "if regex match then proceed"
Regex101 takes 68ms so I would consider that to be "fast".
https://regex101.com/r/q6RgNJ/1

Related

How can I speed up an email-finding regular expression when searching through a massive string?

I have a massive string. It looks something like this:
hej34g934gj93gh398gie foo#bar.com e34y9u394y3h4jhhrjg bar#foo.com hge98gej9rg938h9g34gug
Except that it's much longer (1,000,000+ characters).
My goal is to find all the email addresses in this string.
I've tried a number of solutions, including this one:
#matches foo#bar.com and bar#foo.com
re.findall(r'[\w\.-]{1,100}#[\w\.-]{1,100}', line)
Although the above code technically works, it takes an insane amount of time to execute. I'm not sure if it counts as catastrophic backtracking or if it's just really inefficient, but whatever the case, it's not good enough for my use case.
I suspect that there's a better way to do this. For example, if I use this regex to only search for the latter part of the email addresses:
#matches #bar.com and #foo.com
re.findall(r'#[\w-]{1,256}[\.]{1}[a-z.]{1,64}', line)
It executes in just a few milliseconds.
I'm not familiar enough with regex to write the rest, but I assume that there's some way to find the #x.x part first and then check the first part afterwards? If so, then I'm guessing that would be a lot quicker.

You can use PyPi regex module by Matthew Barnett, that is much more powerful and stable when it comes to parsing long texts. This regex library has some basic checks for pathological cases implemented. The library author mentions at his post:
The internal engine no longer interprets a form of bytecode but
instead follows a linked set of nodes, and it can work breadth-wise as
well as depth-first, which makes it perform much better when faced
with one of those 'pathological' regexes.
However, there is yet another trick you may implement in your regex: Python re (and regex, too) optimize matching at word boundary locations. Thus, if your pattern is supposed to match at a word boundary, always start your pattern with it. In your case, r'\b[\w.-]{1,100}#[\w.-]{1,100}' or r'\b\w[\w.-]{0,99}#[\w.-]{1,100}' should also work much better than the original pattern without a word boundary.
Python test:
import re, regex, timeit
text='your_long_sting'
re_pattern=re.compile(r'\b\w[\w.-]{0,99}#[\w.-]{1,100}')
regex_pattern=regex.compile(r'\b\w[\w.-]{0,99}#[\w.-]{1,100}')
timeit.timeit("p.findall(text)", 'from __main__ import text, re_pattern as p', number=100000)
# => 6034.659449000001
timeit.timeit("p.findall(text)", 'from __main__ import text, regex_pattern as p', number=100000)
# => 218.1561693

Don't use regex on the whole string. Regex are slow. Avoiding them is your best bet to better overall performance.
My first approach would look like this:
Split the string on spaces.
Filter the result down to the parts that contain #.
Create a pre-compiled regex.
Use regex on the remaining parts only to remove false positives.
Another idea:
in a loop....
use .index("#") to find the position of the next candidate
extend e.g. 100 characters to the left, 50 to the right to cover name and domain
adapt the range depending on the last email address you found so you don't overlap
check the range with a regex, if it matches, yield the match

Python: Intersection of full string from list with partial string

Let's say I have a string and a list of strings:
a = 'ABCDEFG'
b = ['ABC', 'QRS', 'AHQ']
How can I pull out the string in list b that matches up perfectly with a section of the string a? So the would return would be something like ['ABC']
The most important issue is that I have tens of millions of strings, so that time efficiency is essential.

If you only want the first match in b:
next((s for s in b if s in a), None)
This has the advantage of short-circuiting as soon as it finds a match whereas the other list solutions will keep going. If no match is found, it will return None.

Keep in mind that Python's substring search x in a is already optimized pretty well for the general case (and coded in C, for CPython), so you're unlikely to beat it in general, especially with pure Python code.
However, if you have a more specialized case, you can do much better.
For example, if you have an arbitrary list of millions of strings b that all need to be searched for within one giant static string a that never changes, preprocessing a can make a huge difference. (Note that this is the opposite of the usual case, where preprocessing the patterns is the key.)
On the other hand, if you expect matches to be unlikely, and you've got the whole b list in advance, you can probably get some large gains by organizing b in some way. For example, there's no point searching for "ABCD" if "ABC" already failed; if you need to search both "ABC" and "ABD" you can search for "AB" first and then check whether it's followed by "C" or "D" so you don't have to repeat yourself; etc. (It might even be possible to merge all of b into a single regular expression that's close enough to optimal… although with millions of elements, that probably isn't the answer.)
But it's hard to guess in advance, with no more information than you've given us, exactly what algorithm you want.
Wikipedia has a pretty good high-level overview of string searching algorithms. There's also a website devoted to pattern matching in general, which seems to be a bit out of date, but then I doubt you're going to turn out to need an algorithm invented in the past 3 years anyway.

Answer:
(x for x in b if x in a )
That will return a generator that will be a list of ones that match. Take the first or loop over it.

In [3]: [s for s in b if s in a]
Out[3]: ['ABC']
On my machine this takes about 3 seconds when b contains 20,000,000 elements (tested with a and b containing strings similar to those in the question).

You might want to have a look at the following algorithm:
Boyer–Moore string search algorithm
And wikipedia
But without knowing more, this might be overkill!

Regular Expressions Dependant on Previous Matchings

For example, how could we recognize a string of the following format with a single RE:
LenOfStr:Str
An example string in this format is:
5:5:str
The string we're looking for is "5:str".
In python, maybe something like the following (this isn't working):
r'(?P<len>\d+):(?P<str>.{int((?P=len))})'
In general, is there a way to change the previously matched groups before using them or I just asked yet another question not meant for RE.
Thanks.

Yep, what you're describing is outside the bounds of regular expressions. Regular expressions only deal with actual character data. This provides some limited ability to make matches dependent on context (e.g., (.)\1 to match the same character twice), but you can't apply arbitrary functions to pieces of an in-progress match and use the results later in the same match.
You could do something like search for text matching the regex (\d+):\w+, and then postprocess the results to check if the string length is equal to the int value of the first part of the match. But you can't do that as part of the matching process itself.

Well this can be done with a regex (if I understand the question):
>>> s='5:5:str and some more characters...'
>>> m=re.search(r'^(\d+):(.*)$',s)
>>> m.group(2)[0:int(m.group(1))]
'5:str'
It just cannot be done by dynamically changing the previous match group.
You can make it lool like a single regex like so:
>>> re.sub(r'^(\d+):(.*)$',lambda m: m.group(2)[0:int(m.group(1))],s)
'5:str'

Regular Expressions in Python unexpectedly slow

Consider this Python code:
import timeit
import re
def one():
any(s in mystring for s in ('foo', 'bar', 'hello'))
r = re.compile('(foo|bar|hello)')
def two():
r.search(mystring)
mystring="hello"*1000
print([timeit.timeit(k, number=10000) for k in (one, two)])
mystring="goodbye"*1000
print([timeit.timeit(k, number=10000) for k in (one, two)])
Basically, I'm benchmarking two ways to check existence of one of several substrings in a large string.
What I get here (Python 3.2.3) is this output:
[0.36678314208984375, 0.03450202941894531]
[0.6672089099884033, 3.7519450187683105]
In the first case, the regular expression easily defeats the any expression - the regular expression finds the substring immediately, while the any has to check the whole string a couple of times before it gets to the correct substring.
But what's going on in the second example? In the case where the substring isn't present, the regular expression is surprisingly slow! This surprises me, since theoretically the regex only has to go over the string once, while the any expression has to go over the string 3 times. What's wrong here? Is there a problem with my regex, or are Python regexs simply slow in this case?

Note to future readers
I think the correct answer is actually that Python's string handling algorithms are really optimized for this case, and the re module is actually a bit slower. What I've written below is true, but is probably not relevant to the simple regexps I have in the question.
Original Answer
Apparently this is not a random fluke - Python's re module really is slower. It looks like it uses a recursive backtracking approach when it fails to find a match, as opposed to building a DFA and simulating it.
It uses the backtracking approach even when there are no back references in the regular expression!
What this means is that in the worst case, Python regexs take exponential, and not linear, time!
This is a very detailed paper describing the issue:
http://swtch.com/~rsc/regexp/regexp1.html
I think this graph near the end summarizes it succinctly:

My coworker found the re2 library (https://code.google.com/p/re2/)? There is a python wrapper. It's a bit to get installed on some systems.
I was having the same issue with some complex regexes and long strings -- re2 sped the processing time up significantly -- from seconds to milliseconds.

The reason the regex is so slow is because it not only has to go through the whole string, but it has to several calculations at every character.
The first one simply does this:
Does f match h? No.
Does b match h? No.
Does h match h? Yes.
Does e match e? Yes.
Does l match l? Yes.
Does l match l? Yes.
Does o match o? Yes.
Done. Match found.
The second one does this:
Does f match g? No.
Does b match g? No.
Does h match g? No.
Does f match o? No.
Does b match o? No.
Does h match o? No.
Does f match o? No.
Does b match o? No.
Does h match o? No.
Does f match d? No.
Does b match d? No.
Does h match d? No.
Does f match b? No.
Does b match b? Yes.
Does a match y? No.
Does h match b? No.
Does f match y? No.
Does b match y? No.
Does h match y? No.
Does f match e? No.
Does b match e? No.
Does h match e? No.
... 999 more times ...
Done. No match found.
I can only speculate about the difference between the any and regex, but I'm guessing the regex is slower mostly because it runs in a highly complex engine, and with state machine stuff and everything, it just isn't as efficient as a specific implementation (in).
In the first string, the regex will find a match almost instantaneously, while any has to loop through the string twice before finding anything.
In the second string, however, the any performs essentially the same steps as the regex, but in a different order. This seems to point out that the any solution is faster, probably because it is simpler.
Specific code is more efficient than generic code. Any knowledge about the problem can be put to use in optimizing the solution. Simple code is preferred over complex code. Essentially, the regex is faster when the pattern will be near the start of the string, but in is faster when the pattern is near the end of the string, or not found at all.
Disclaimer: I don't know Python. I know algorithms.

You have a regexp that is made up of three regexps. Exactly how do you think that works, if the regexp doesn't check this three times? :-) There's no magic in computing, you still have to do three checks.
But the regexp will do each three tests character by character, while the "one()" method will check the whole string for one match before going onto the next one.
That the regexp is much faster in the first case is because you check for the string that will match last. That means one() needs to first look through the whole string for "foo", then for "bar" and then for "hello", where it matches. Move "hello" first, and one() and two() are almost the same speed, as the first match done in both cases succeed.
Regexps are much more complex tests than "in" so I'd expect it to be slower. I suspect that this complexity increases a lot when you use "|", but I haven't read the source for the regexp library, so what do I know. :-)

Pad an integer using a regular expression

I'm using regular expressions with a python framework to pad a specific number in a version number:
10.2.11
I want to transform the second element to be padded with a zero, so it looks like this:
10.02.11
My regular expression looks like this:
^(\d{2}\.)(\d{1})([\.].*)
If I just regurgitate back the matching groups, I use this string:
\1\2\3
When I use my favorite regular expression test harness (http://kodos.sourceforge.net/), I can't get it to pad the second group. I tried \1\20\3, but that interprets the second reference as 20, and not 2.
Because of the library I'm using this with, I need it to be a one liner. The library takes a regular expression string, and then a string for what should be used to replace it with.
I'm assuming I just need to escape the matching groups string, but I can't figure it out. Thanks in advance for any help.

How about a completely different approach?
nums = version_string.split('.')
print ".".join("%02d" % int(n) for n in nums)

What about removing the . from the regex?
^(\d{2})\.(\d{1})[\.](.*)
replace with:
\1.0\2.\3

Try this:
(^\d(?=\.)|(?<=\.)\d(?=\.)|(?<=\.)\d$)
And replace the match by 0\1. This will make any number at least two digits long.

Does your library support named groups? That might solve your problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.