I need to efficiently counter Python Regex matches. Findall and finditer works, but are slow for large number of searches.
So far I have this (much simplified version):
import re
testItem = re.compile(r"var")
for matches in testItem.finditer(stringData):
counter = counter + 1
return counter
I am running multiple instances of this method. I do not care for the matches themselves; I simply wish to return the counter.
The issue is that the stringData is a very large string. The Regex itself is pretty simple.
Please advise on a more efficient way to do this.
Thanks in advance.
Since you show in your edit that you're just looking for a substring,
stringData.count('var')
should serve you well.
Of course, this does not generalize to many other uses of REs! Unfortunately, at least as of Python 3.4, re.finditer returns an iterator which does not support the "length hint" formalized by PEP 424, so there aren't many good alternatives (for the general case) to
sum(1 for _ in there.finditer(stringData))
Related
I have a massive string. It looks something like this:
hej34g934gj93gh398gie foo#bar.com e34y9u394y3h4jhhrjg bar#foo.com hge98gej9rg938h9g34gug
Except that it's much longer (1,000,000+ characters).
My goal is to find all the email addresses in this string.
I've tried a number of solutions, including this one:
#matches foo#bar.com and bar#foo.com
re.findall(r'[\w\.-]{1,100}#[\w\.-]{1,100}', line)
Although the above code technically works, it takes an insane amount of time to execute. I'm not sure if it counts as catastrophic backtracking or if it's just really inefficient, but whatever the case, it's not good enough for my use case.
I suspect that there's a better way to do this. For example, if I use this regex to only search for the latter part of the email addresses:
#matches #bar.com and #foo.com
re.findall(r'#[\w-]{1,256}[\.]{1}[a-z.]{1,64}', line)
It executes in just a few milliseconds.
I'm not familiar enough with regex to write the rest, but I assume that there's some way to find the #x.x part first and then check the first part afterwards? If so, then I'm guessing that would be a lot quicker.
You can use PyPi regex module by Matthew Barnett, that is much more powerful and stable when it comes to parsing long texts. This regex library has some basic checks for pathological cases implemented. The library author mentions at his post:
The internal engine no longer interprets a form of bytecode but
instead follows a linked set of nodes, and it can work breadth-wise as
well as depth-first, which makes it perform much better when faced
with one of those 'pathological' regexes.
However, there is yet another trick you may implement in your regex: Python re (and regex, too) optimize matching at word boundary locations. Thus, if your pattern is supposed to match at a word boundary, always start your pattern with it. In your case, r'\b[\w.-]{1,100}#[\w.-]{1,100}' or r'\b\w[\w.-]{0,99}#[\w.-]{1,100}' should also work much better than the original pattern without a word boundary.
Python test:
import re, regex, timeit
text='your_long_sting'
re_pattern=re.compile(r'\b\w[\w.-]{0,99}#[\w.-]{1,100}')
regex_pattern=regex.compile(r'\b\w[\w.-]{0,99}#[\w.-]{1,100}')
timeit.timeit("p.findall(text)", 'from __main__ import text, re_pattern as p', number=100000)
# => 6034.659449000001
timeit.timeit("p.findall(text)", 'from __main__ import text, regex_pattern as p', number=100000)
# => 218.1561693
Don't use regex on the whole string. Regex are slow. Avoiding them is your best bet to better overall performance.
My first approach would look like this:
Split the string on spaces.
Filter the result down to the parts that contain #.
Create a pre-compiled regex.
Use regex on the remaining parts only to remove false positives.
Another idea:
in a loop....
use .index("#") to find the position of the next candidate
extend e.g. 100 characters to the left, 50 to the right to cover name and domain
adapt the range depending on the last email address you found so you don't overlap
check the range with a regex, if it matches, yield the match
I have been working on a program which requires the counting of sub-strings (up to 4000 sub-strings of 2-6 characters located in a list) inside a main string (~400,000 characters). I understand this is similar to the question asked at Counting substrings in a string, however, this solution does not work for me. Since my sub-strings are DNA sequences, many of my sub-strings are repetitive instances of a single character (e.g. 'AA'); therefore, 'AAA' will be interpreted as a single instance of 'AA' rather than two instances if i split the string by 'AA'. My current solution is using nested loops, but I'm hoping there is a faster way as this code takes 5+ minutes for a single main string. Thanks in advance!
def getKmers(self, kmer):
self.kmer_dict = {}
kmer_tuples = list(product(['A', 'C', 'G', 'T'], repeat = kmer))
kmer_list = []
for x in range(len(kmer_tuples)):
new_kmer = ''
for y in range(kmer):
new_kmer += kmer_tuples[x][y]
kmer_list.append(new_kmer)
for x in range(len(kmer_list)):
self.kmer_dict[kmer_list[x]] = 0
for x in range(len(self.sequence)-kmer):
for substr in kmer_list:
if self.sequence[x:x+kmer] == substr:
self.kmer_dict[substr] += 1
break
return self.kmer_dict
For counting overlapping substrings of DNA, you can use Biopython:
>>> from Bio.Seq import Seq
>>> Seq('AAA').count_overlap('AA')
2
Disclaimer: I wrote this method, see commit 97709cc.
However, if you're looking for really high performance, Python probably isn't the right language choice (although an extension like Cython could help).
Of course Python is fully able to perform these string searches. But instead of re-inventing all the wheels you will need, one screw at a time, you would be better of using a more specialized tool inside Python to deal with your problem - it looks like the BioPython project is the most activelly maintained and complete to deal with this sort of problem.
Short post with an example resembling your problem:
https://dodona.ugent.be/nl/exercises/1377336647/
Link to the BioPython project documentation: https://biopython.org/wiki/Documentation
(if the problem were simply overlapping strings, then the 3rd party "regex" module would be a way to go - https://pypi.org/project/regex/ - as the built-in regex engine in Python's re module can't deal with overlapping sequence either)
Seems like a simple thing but I'm not seeing it. How do I start the search in the middle of a string?
The re.search function doesn't take a start argument like the str methods do. But search method of a compiled re.compile/re.RegexObject pattern does take a pos argument.
This makes sense if you think about it. If you really need to use the same regular expressions over and over, you probably should be compiling them. Not so much for efficiency—the cache works nicely for most applications—but just for readability.
But what if you need to use the top-level function, because you can't pre-compile your patterns for some reason?
Well, there are plenty of third-party regular expression libraries. Some of these wrap PCRE or Google's RE2 or ICU, some implement regular expressions from scratch, and they all have at least slightly different, sometimes radically different, APIs.
But the regex module, which is being designed to be an eventual replacement for re in the stdlib (although it's been bumped a couple times now because it's not quite ready) is pretty much usable as a drop-in replacement for re, and (among other extensions) it takes pos and endpos arguments on its search function.
Normally, the most common reason you'd want to do this is to "find the next match after the one I just found", and there's a much easier way to do that: use finditer instead of search.
For example, this str-method loop:
i = 0
while True:
i = s.find(sub, i)
if i == -1:
break
do_stuff_with(s, i)
… translates to this much nicer regex loop:
for match in re.finditer(pattern, s):
do_stuff_with(match)
When that isn't appropriate, you can always slice the string:
match = re.search(pattern, s[index:])
But that makes an extra copy of half your string, which could be a problem if string is actually, say, a 12GB mmap. (Of course for the 12GB mmap case, you'd probably want to map a new window… but there are cases where that won't help.)
Finally, you can always just modify your pattern to skip over index characters:
match = re.search('.{%d}%s' % (index, pattern), s)
All I've done here is to add, e.g., .{20} to the start of the pattern, which means to match exactly 20 of any character, plus whatever else you were trying to match. Here's a simple example:
.{3}(abc)
Debuggex Demo
If I give this abcdefabcdef, it will match the first 'abc' after the 3rd character—that is, the second abc.
But notice that what it actually matches 'defabc'. Because I'm using capture groups for my real pattern, and I'm not putting the .{3} in a group, match.group(1) and so on will work exactly as I'd want them to, but match.group(0) will give me the wrong thing. If that matters, you need lookbehind.
My python script would read each line in file and do many regex replacements in each line.
If the regex success, skip to the next line
Is there any way to speed up this kind of script?
Is it worth to call subn instead and check if replacement done and then skip to the remain one?
If I compile the regex, is it possible to store all the compiled regex in memory?
for file in files:
for line in file:
re.sub() # <--- ~ 100 re.sub
PS: the replacement vaires for each regex
You should probably do three things:
Reduce the number of regexes. Depending on differences in the substitution part, you might be able to combine them all into a single one. Using careful alternation, you can determine the sequence in which parts of the regex will be matched.
If possible (depending on file size), read the file into memory completely.
Compile your regex (only for readability; it won't matter in terms of speed as long as the number of regexes stays below 100).
This gives you something like:
regex = re.compile(r"My big honking regex")
for datafile in files:
content = datafile.read()
result = regex.sub("Replacement", content)
As #Tim Pietzcker said, you could reduce the number of regexes by making them alternatives. You can determine which alternative matched by the using the 'lastindex' attribute of the match object.
Here's an example of what you could do:
>>> import re
>>> replacements = {1: "<UPPERCASE LETTERS>", 2: "<lowercase letters>", 3: "<Digits>"}
>>> def replace(m):
... return replacements[m.lastindex]
...
>>> re.sub(r"([A-Z]+)|([a-z]+)|([0-9]+)", replace, "ABC def 789")
'<UPPERCASE LETTERS> <lowercase letters> <Digits>'
I'm writing a python program that deals with a fair amount of strings/files. My problem is that I'm going to be presented with a fairly short piece of text, and I'm going to need to search it for instances of a fairly broad range of words/phrases.
I'm thinking I'll need to compile regular expressions as a way of matching these words/phrases in the text. My concern, however, is that this will take a lot of time.
My question is how fast is the process of repeatedly compiling regular expressions, and then searching through a small body of text to find matches? Would I be better off using some string method?
Edit: So, I guess an example of my question would be: How expensive would it be to compile and search with one regular expression versus say, iterating 'if "word" in string' say, 5 times?
You should try to compile all your regexps into a single one using the | operator. That way, the regexp engine will do most of the optimizations for you. Use the grouping operator () to determine which regexp matched.
If speed is of the essence, you are better off running some tests before you decide how to code your production application.
First of all, you said that you are searching for words which suggests that you may be able to do this using split() to break up the string on whitespace. And then use simple string comparisons to do your search.
Definitely do compile your regular expressions and do a timing test comparing that with the plain string functions. Check the documentation for the string class for a full list.
Your requirement appears to be searching a text for the first occurrence of any one of a collection of strings. Presumably you then wish to restart the search to find the next occurrence, and so on until the searched string is exhausted. Only plain old string comparison is involved.
The classic algorithm for this task is Aho-Corasick for which there is a Python extension (written in C). This should beat the socks off any alternative that's using the re module.
If you like to know how does it fast during compiling regex patterns, you need to benchmark it.
Here is how I do that. Its compile 1 Million time each patterns.
import time,re
def taken(f):
def wrap(*arg):
t1,r,t2=time.time(),f(*arg),time.time()
print t2-t1,"s taken"
return r
return wrap
#taken
def regex_compile_test(x):
for i in range(1000000):
re.compile(x)
print "for",x,
#sample tests
regex_compile_test("a")
regex_compile_test("[a-z]")
regex_compile_test("[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}")
Its took around 5 min for each patterns in my computer.
for a 4.88999986649 s taken
for [a-z] 4.70300006866 s taken
for [A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4} 4.78200006485 s taken
The real Bottleneck is not in compiling patterns, its in extracting text like re.findall, replacing re.sub. If you use that against Several MB texts, Its quite slow.
If your text is fixed, use normal str.find, its faster than regex.
Actually, If you give your text samples, and your regex patterns samples, we could give you better idea, there is many many great regex, and python guys out there.
Hope this help, sorry If my answer couldn't help you.
When you compile the regexp, it is converted into a state machine representation. Provided the regexp is efficiently expressed, it should still be very fast to match. Compiling the regexp can be expensive though, so you will want to do that up front, and as infrequently as possible. Ultimately though, only you can answer if it is fast enough for your requirements.
There are other string searching approaches, such as the Boyer-Moore algorithm. But I'd wager the complexity of searching for multiple separate strings is much higher than a regexp that can switch off each successive character.
This is a question that can readily be answered by just trying it.
>>> import re
>>> import timeit
>>> find = ['foo', 'bar', 'baz']
>>> pattern = re.compile("|".join(find))
>>> with open('c:\\temp\\words.txt', 'r') as f:
words = f.readlines()
>>> len(words)
235882
>>> timeit.timeit('r = filter(lambda w: any(s for s in find if w.find(s) >= 0), words)', 'from __main__ import find, words', number=30)
18.404569854548527
>>> timeit.timeit('r = filter(lambda w: any(s for s in find if s in w), words)', 'from __main__ import find, words', number=30)
10.953313759150944
>>> timeit.timeit('r = filter(lambda w: pattern.search(w), words)', 'from __main__ import pattern, words', number=30)
6.8793022576891758
It looks like you can reasonably expect regular expressions to be faster than using find or in. Though if I were you I'd repeat this test with a case that was more like your real data.
If you're just searching for a particular substring, use str.find() instead.
Depending on what you're doing it might be better to use a tokenizer and loop through the tokens to find matches.
However, when it comes to short pieces of text regexes have incredibly good performance. Personally I remember only coming into problems when text sizes became ridiculous like 100k words or something like that.
Furthermore, if you are worried about the speed of actual regex compilation rather than matching, you might benefit from creating a daemon that compiles all the regexes then goes through all the pieces of text in a big loop or runs as a service. This way you will only have to compile the regexes once.
in general case, you can use "in" keyword
for line in open("file"):
if "word" in line:
print line.rstrip()
regex is usually not needed when you use Python :)