regarding regex (specifically python re), if we ignore the way the expression is written, is the length of the text the only factor for the time required to process the document? Or are there other factors (like how the text is structured) that play important roles too?
One important consideration can also be whether the text actually matches the regular expression. Take (as a contrived example) the regex (x+x+)+y from this regex tutorial.
When applied to xxxxxxxxxxy it matches, taking the regex engine 7 steps. When applied to xxxxxxxxxx, it fails (of course), but it takes the engine 2558 steps to arrive at this conclusion.
For xxxxxxxxxxxxxxy vs. xxxxxxxxxxxxxx it's already 7 vs 40958 steps, and so on exponentially...
This happens especially easily with nested repetitions or regexes where the same text can be matched by two or more different parts of the regex, forcing the engine to try all permutations before being able to declare failure. This is then called catastrophic backtracking.
Both the length of the text and its contents are important.
As an example the regular expression a+b will fail to match quickly on a string containing one million bs but more slowly on a string containing one million as. This is because more backtracking will be required in the second case.
import timeit
x = "re.search('a+b', s)"
print timeit.timeit(x, "import re;s='a'*10000", number=10)
print timeit.timeit(x, "import re;s='b'*10000", number=10)
Results:
6.85791902323
0.00795443275612
To refactor a regex to create a multi-level trie covers 95% of of the
800% increase in performance. The other 5% involves factoring to not only facilitate
the trie but to enhance it to give a possible 30x performance boost.
Related
I created 2 regex (re1 and re2), if I try to compile the first regex (re1) it takes about 30 seconds to find all matches.
and if I try to compile the second regex (re2) it takes about 1 second to find all matches.
Can you help me find the difference or what causing this problem?
Thanks!
import re
data = b'000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000'
re1 = b'.*00.*00.*00.*00.*76.*62.*55.*75.*'
re2 = b'.*aa.*00.*00.*6f.*63.*20.*6d.*75.*'
reg = re.compile(b'^%s$' % re1, re.RegexFlag.M)
results = len(reg.findall(data))
print(results)
The problem comes from backtracking in the implementation of CPython regexp engine (as emphasized by #Thefourthbird). It comes more specifically from the first and the last .* which are not needed if data do not contain new line characters. Indeed, in this case, findall will either find only one match (all data due to the .*) or nothing. So you do not need findall: a search is enough. Moreover, using ^ and $ with .* prefixed and suffixed is not useful too. The following code should produce the same effect but is 20 times faster on my machine (still not very fast regarding the input size).
import re
data = b'000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000'
re1 = b'00.*00.*00.*00.*76.*62.*55.*75'
re2 = b'aa.*00.*00.*6f.*63.*20.*6d.*75'
reg = re.compile(re1, re.RegexFlag.M)
results = 1 if reg.search(data) else 0
print(results)
If data contain new line characters, then it is a bit more complex as only the line containing the pattern will be matched. Indeed, . does not match a new line character (since RegexFlag.DOTALL is not present). One solution to overcome the problem consists in splitting lines before and then apply the regexp. Another solution consists in using the above code, then replacing the search line with a finditer to track the location of the matches, then track the beginning and the end of the lines of each match.
If you want to know more about why backtracking causes such a slow execution you can look at this post: Why can regular expressions have an exponential running time?.
Note that re1 is a quite critical regexp for this data. There are much faster regexp engine to compute it. The Google's RE2 regexp engine is a linear-time regexp engine that should be much faster in such critical cases (but generally not so on non-critical data). There is also the Intel's Hyperscan regexp engine which is generally very fast compared to other engines (although a bit less user-friendly).
I have a massive string. It looks something like this:
hej34g934gj93gh398gie foo#bar.com e34y9u394y3h4jhhrjg bar#foo.com hge98gej9rg938h9g34gug
Except that it's much longer (1,000,000+ characters).
My goal is to find all the email addresses in this string.
I've tried a number of solutions, including this one:
#matches foo#bar.com and bar#foo.com
re.findall(r'[\w\.-]{1,100}#[\w\.-]{1,100}', line)
Although the above code technically works, it takes an insane amount of time to execute. I'm not sure if it counts as catastrophic backtracking or if it's just really inefficient, but whatever the case, it's not good enough for my use case.
I suspect that there's a better way to do this. For example, if I use this regex to only search for the latter part of the email addresses:
#matches #bar.com and #foo.com
re.findall(r'#[\w-]{1,256}[\.]{1}[a-z.]{1,64}', line)
It executes in just a few milliseconds.
I'm not familiar enough with regex to write the rest, but I assume that there's some way to find the #x.x part first and then check the first part afterwards? If so, then I'm guessing that would be a lot quicker.
You can use PyPi regex module by Matthew Barnett, that is much more powerful and stable when it comes to parsing long texts. This regex library has some basic checks for pathological cases implemented. The library author mentions at his post:
The internal engine no longer interprets a form of bytecode but
instead follows a linked set of nodes, and it can work breadth-wise as
well as depth-first, which makes it perform much better when faced
with one of those 'pathological' regexes.
However, there is yet another trick you may implement in your regex: Python re (and regex, too) optimize matching at word boundary locations. Thus, if your pattern is supposed to match at a word boundary, always start your pattern with it. In your case, r'\b[\w.-]{1,100}#[\w.-]{1,100}' or r'\b\w[\w.-]{0,99}#[\w.-]{1,100}' should also work much better than the original pattern without a word boundary.
Python test:
import re, regex, timeit
text='your_long_sting'
re_pattern=re.compile(r'\b\w[\w.-]{0,99}#[\w.-]{1,100}')
regex_pattern=regex.compile(r'\b\w[\w.-]{0,99}#[\w.-]{1,100}')
timeit.timeit("p.findall(text)", 'from __main__ import text, re_pattern as p', number=100000)
# => 6034.659449000001
timeit.timeit("p.findall(text)", 'from __main__ import text, regex_pattern as p', number=100000)
# => 218.1561693
Don't use regex on the whole string. Regex are slow. Avoiding them is your best bet to better overall performance.
My first approach would look like this:
Split the string on spaces.
Filter the result down to the parts that contain #.
Create a pre-compiled regex.
Use regex on the remaining parts only to remove false positives.
Another idea:
in a loop....
use .index("#") to find the position of the next candidate
extend e.g. 100 characters to the left, 50 to the right to cover name and domain
adapt the range depending on the last email address you found so you don't overlap
check the range with a regex, if it matches, yield the match
I am looking for help in making an efficient way to process some high throughput DNA sequencing data.
The data are in 5 files with a few hundred thousand sequences each, within which each sequence is formatted as follows:
#M01102:307:000000000-BCYH3:1:1102:19202:1786 1:N:0:TAGAGGCA+CTCTCTCT
TAATACGACTCACTATAGGGTTAACTTTAAGAGGGAGATATACATATGAGTCTTTTGGGTAAGAAGCCTTTTTGTCTGCTTTATGGTCCTATCTGCGGCAGGGCCAGCGGCAGCTAGGACGGGGGGCGGATAAGATCGGAAGAGCACTCGTCTGAACTCCAGTCACTAGAGGCAATCTCGT
+
AAABBFAABBBFGGGGFGGGGGAG5GHHHCH54BEEEEA5GGHDHHHH5BAE5DF5GGCEB33AF3313GHHHE255D55D55D53#5#B5DBD5#E/#//>/1??/?/E#///FDF0B?CC??CAAA;--./;/BBE?;AFFA./;/;.;AEA//BFFFF/BB/////;/..:.9999.;
What I am doing at the moment is iterating over the lines, checking if the first and last letter is an allowed character for a DNA sequence (A/C/G/T or N), then doing a fuzzy search for the two primer sequences that flank the coding sequence fragment I am interested in. This last step is the part where things are going wrong...
When I search for exact matches, I get useable data in a reasonable time frame. However, I know I am missing out on a lot of data that is being skipped because of a single mis-match in the primer sequences. This happens because read quality degrades with length, and so more unreadable bases ('N') crop up. These aren't a problem in my analysis otherwise, but are a problem with a simple direct string search approach -- N should be allowed to match with anything from a DNA perspective, but is not from a string search perspective (I am less concerned about insertion or deletions). For this reason I am trying to implement some sort of fuzzy or more biologically informed search approach, but have yet to find an efficient way of doing it.
What I have now does work on test datasets, but is much too slow to be useful on a full size real dataset. The relevant fragment of the code is:
from Bio import pairwise2
Sequence = 'NNNNNTAATACGACTCACTATAGGGTTAACTTTAAGAGGGAGATATACATATGAGTCTTTTGGGTAAGAAGCCTTTTTGTCTGCTTTATGGTCCTATCTGCGGCAGGGCCAGCGGCAGCTAGGACGGGGGGCGGATAAGATCGGAAGAGCACTCGTCTGAACTCCAGTCACTAGAGGCAATCTCGT'
fwdprimer = 'TAATACGACTCACTATAGGGTTAACTTTAAGAAGGAGATATACATATG'
revprimer = 'TAGGACGGGGGGCGGAAA'
if Sequence.endswith(('N','A','G','T','G')) and Sequence.startswith(('N','A','G','T','G')):
fwdalign = pairwise2.align.localxs(Sequence,fwdprimer,-1,-1, one_alignment_only=1)
revalign = pairwise2.align.localxs(Sequence,revprimer,-1,-1, one_alignment_only=1)
if fwdalign[0][2]>45 and revalign[0][2]>15:
startIndex = fwdalign[0][3]+45
endIndex = revalign[0][3]+3
Sequence = Sequence[startIndex:endIndex]
print Sequence
(obviously the first conditional is not needed in this example, but helps to filter out the other 3/4 of the lines that don't have DNA sequence and so don't need to be searched)
This approach uses the pairwise alignment method from biopython, which is designed for finding alignments of DNA sequences with mismatches allowed. That part it does well, but because it needs to do a sequence alignment for each sequence with both primers it takes way too long to be practical. All I need it to do is find the matching sequence, allowing for one or two mismatches. Is there another way of doing this that would serve my goals but be computationally more feasible? For comparison, the following code from a previous version works plenty fast with my full data sets:
if ('TAATACGACTCACTATAGGGTTAACTTTAAGAAGGAGATATACATATG' in Line) and ('TAGGACGGGGGGCGGAAA' in Line):
startIndex = Line.find('TAATACGACTCACTATAGGGTTAACTTTAAGAAGGAGATATACATATG')+45
endIndex = Line.find('TAGGACGGGGGGCGGAAA')+3
Line = Line[startIndex:endIndex]
print Line
This is not something I run frequently, so don't mind if it is a little inefficient, but don't want to have to leave it running for a whole day. I would like to get a result in seconds or minutes, not hours.
The tre library provides fast approximate matching functions. You can specify the maximum number of mismatched characters with maxerr as in the example below:
https://github.com/laurikari/tre/blob/master/python/example.py
There is also the regex module, which supports fuzzy searching options: https://pypi.org/project/regex/#additional-features
In addition, you can also use a simple regular expression to allow alternate characters as in:
# Allow any character to be N
pattern = re.compile('[TN][AN][AN][TN]')
if pattern.match('TANN'):
print('found')
I have several million strings, X, each with less than 20 or so words. I also have a list of several thousand candidate substrings C. for each x in X, I want to see if there are any strings in C that are contained in x. Right now I am using a naive double for loop, but it's been a while and it hasn't finished yet...Any suggestions? I'm using python if any one knows of a nice implementation, but links for any language or general algorithms would be nice too.
Encode one of your sets of strings as a trie (I recommend the bigger set). Lookup time should be faster than an imperfect hash and you will save some memory too.
It's gonna be a long while. You have to check every one of those several million strings against every one of those several thousand candidate substrings, meaning that you will be doing (several million * several thousand) string comparisons. Yeah, that will take a while.
If this is something that you're only going to do once or infrequently, I would suggest using fgrep. If this is something that you're going to do often, then you want to look into implementing something like the Aho-Corasick string matching algorithm.
If your x in X only contains words, and you only want to match words you could do the following:
Insert your keywords into a set, that makes the access log(n), and then check for every word in x if it is contained in that set.
like:
keywords = set(['bla', 'fubar'])
for w in [x.split(' ') for x in X]:
if w in keywords:
pass # do what you need to do
A good alternative would be to use googles re2 library, that uses super nice automata theory to produce efficient matchers. (http://code.google.com/p/re2/)
EDIT: Be sure you use proper buffering and something in a compiled language, that makes it a lot faster. If its less than a couple gigabytes, it should work with python too.
you could try to use regex
subs=re.compile('|'.join(C))
for x in X:
if subs.search(x):
print 'found'
Have a look at http://en.wikipedia.org/wiki/Aho-Corasick. You can build a pattern-matcher for a set of fixed strings in time linear in the total size of the strings, then search in text, or multiple sections of text, in time linear in the length of the text + the number of matches found.
Another fast exact pattern matcher is http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm
I'm writing a python program that deals with a fair amount of strings/files. My problem is that I'm going to be presented with a fairly short piece of text, and I'm going to need to search it for instances of a fairly broad range of words/phrases.
I'm thinking I'll need to compile regular expressions as a way of matching these words/phrases in the text. My concern, however, is that this will take a lot of time.
My question is how fast is the process of repeatedly compiling regular expressions, and then searching through a small body of text to find matches? Would I be better off using some string method?
Edit: So, I guess an example of my question would be: How expensive would it be to compile and search with one regular expression versus say, iterating 'if "word" in string' say, 5 times?
You should try to compile all your regexps into a single one using the | operator. That way, the regexp engine will do most of the optimizations for you. Use the grouping operator () to determine which regexp matched.
If speed is of the essence, you are better off running some tests before you decide how to code your production application.
First of all, you said that you are searching for words which suggests that you may be able to do this using split() to break up the string on whitespace. And then use simple string comparisons to do your search.
Definitely do compile your regular expressions and do a timing test comparing that with the plain string functions. Check the documentation for the string class for a full list.
Your requirement appears to be searching a text for the first occurrence of any one of a collection of strings. Presumably you then wish to restart the search to find the next occurrence, and so on until the searched string is exhausted. Only plain old string comparison is involved.
The classic algorithm for this task is Aho-Corasick for which there is a Python extension (written in C). This should beat the socks off any alternative that's using the re module.
If you like to know how does it fast during compiling regex patterns, you need to benchmark it.
Here is how I do that. Its compile 1 Million time each patterns.
import time,re
def taken(f):
def wrap(*arg):
t1,r,t2=time.time(),f(*arg),time.time()
print t2-t1,"s taken"
return r
return wrap
#taken
def regex_compile_test(x):
for i in range(1000000):
re.compile(x)
print "for",x,
#sample tests
regex_compile_test("a")
regex_compile_test("[a-z]")
regex_compile_test("[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}")
Its took around 5 min for each patterns in my computer.
for a 4.88999986649 s taken
for [a-z] 4.70300006866 s taken
for [A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4} 4.78200006485 s taken
The real Bottleneck is not in compiling patterns, its in extracting text like re.findall, replacing re.sub. If you use that against Several MB texts, Its quite slow.
If your text is fixed, use normal str.find, its faster than regex.
Actually, If you give your text samples, and your regex patterns samples, we could give you better idea, there is many many great regex, and python guys out there.
Hope this help, sorry If my answer couldn't help you.
When you compile the regexp, it is converted into a state machine representation. Provided the regexp is efficiently expressed, it should still be very fast to match. Compiling the regexp can be expensive though, so you will want to do that up front, and as infrequently as possible. Ultimately though, only you can answer if it is fast enough for your requirements.
There are other string searching approaches, such as the Boyer-Moore algorithm. But I'd wager the complexity of searching for multiple separate strings is much higher than a regexp that can switch off each successive character.
This is a question that can readily be answered by just trying it.
>>> import re
>>> import timeit
>>> find = ['foo', 'bar', 'baz']
>>> pattern = re.compile("|".join(find))
>>> with open('c:\\temp\\words.txt', 'r') as f:
words = f.readlines()
>>> len(words)
235882
>>> timeit.timeit('r = filter(lambda w: any(s for s in find if w.find(s) >= 0), words)', 'from __main__ import find, words', number=30)
18.404569854548527
>>> timeit.timeit('r = filter(lambda w: any(s for s in find if s in w), words)', 'from __main__ import find, words', number=30)
10.953313759150944
>>> timeit.timeit('r = filter(lambda w: pattern.search(w), words)', 'from __main__ import pattern, words', number=30)
6.8793022576891758
It looks like you can reasonably expect regular expressions to be faster than using find or in. Though if I were you I'd repeat this test with a case that was more like your real data.
If you're just searching for a particular substring, use str.find() instead.
Depending on what you're doing it might be better to use a tokenizer and loop through the tokens to find matches.
However, when it comes to short pieces of text regexes have incredibly good performance. Personally I remember only coming into problems when text sizes became ridiculous like 100k words or something like that.
Furthermore, if you are worried about the speed of actual regex compilation rather than matching, you might benefit from creating a daemon that compiles all the regexes then goes through all the pieces of text in a big loop or runs as a service. This way you will only have to compile the regexes once.
in general case, you can use "in" keyword
for line in open("file"):
if "word" in line:
print line.rstrip()
regex is usually not needed when you use Python :)