Speed up a series of regex replacement in python - python

My python script would read each line in file and do many regex replacements in each line.
If the regex success, skip to the next line
Is there any way to speed up this kind of script?
Is it worth to call subn instead and check if replacement done and then skip to the remain one?
If I compile the regex, is it possible to store all the compiled regex in memory?
for file in files:
for line in file:
re.sub() # <--- ~ 100 re.sub
PS: the replacement vaires for each regex

You should probably do three things:
Reduce the number of regexes. Depending on differences in the substitution part, you might be able to combine them all into a single one. Using careful alternation, you can determine the sequence in which parts of the regex will be matched.
If possible (depending on file size), read the file into memory completely.
Compile your regex (only for readability; it won't matter in terms of speed as long as the number of regexes stays below 100).
This gives you something like:
regex = re.compile(r"My big honking regex")
for datafile in files:
content = datafile.read()
result = regex.sub("Replacement", content)

As #Tim Pietzcker said, you could reduce the number of regexes by making them alternatives. You can determine which alternative matched by the using the 'lastindex' attribute of the match object.
Here's an example of what you could do:
>>> import re
>>> replacements = {1: "<UPPERCASE LETTERS>", 2: "<lowercase letters>", 3: "<Digits>"}
>>> def replace(m):
... return replacements[m.lastindex]
...
>>> re.sub(r"([A-Z]+)|([a-z]+)|([0-9]+)", replace, "ABC def 789")
'<UPPERCASE LETTERS> <lowercase letters> <Digits>'

Related

Regex taking too much time

I created 2 regex (re1 and re2), if I try to compile the first regex (re1) it takes about 30 seconds to find all matches.
and if I try to compile the second regex (re2) it takes about 1 second to find all matches.
Can you help me find the difference or what causing this problem?
Thanks!
import re
data = b'000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000'
re1 = b'.*00.*00.*00.*00.*76.*62.*55.*75.*'
re2 = b'.*aa.*00.*00.*6f.*63.*20.*6d.*75.*'
reg = re.compile(b'^%s$' % re1, re.RegexFlag.M)
results = len(reg.findall(data))
print(results)
The problem comes from backtracking in the implementation of CPython regexp engine (as emphasized by #Thefourthbird). It comes more specifically from the first and the last .* which are not needed if data do not contain new line characters. Indeed, in this case, findall will either find only one match (all data due to the .*) or nothing. So you do not need findall: a search is enough. Moreover, using ^ and $ with .* prefixed and suffixed is not useful too. The following code should produce the same effect but is 20 times faster on my machine (still not very fast regarding the input size).
import re
data = b'000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000'
re1 = b'00.*00.*00.*00.*76.*62.*55.*75'
re2 = b'aa.*00.*00.*6f.*63.*20.*6d.*75'
reg = re.compile(re1, re.RegexFlag.M)
results = 1 if reg.search(data) else 0
print(results)
If data contain new line characters, then it is a bit more complex as only the line containing the pattern will be matched. Indeed, . does not match a new line character (since RegexFlag.DOTALL is not present). One solution to overcome the problem consists in splitting lines before and then apply the regexp. Another solution consists in using the above code, then replacing the search line with a finditer to track the location of the matches, then track the beginning and the end of the lines of each match.
If you want to know more about why backtracking causes such a slow execution you can look at this post: Why can regular expressions have an exponential running time?.
Note that re1 is a quite critical regexp for this data. There are much faster regexp engine to compute it. The Google's RE2 regexp engine is a linear-time regexp engine that should be much faster in such critical cases (but generally not so on non-critical data). There is also the Intel's Hyperscan regexp engine which is generally very fast compared to other engines (although a bit less user-friendly).

Pushing data into regex?

I am working on a small project which I call pydlp. It is basically a set of regex signatures that will extract data from a fileobject. And a function that check if extracted data is in fact interesting.
This code is how I perform matching. It is far from optimal, as I have to read the file over and over again.
for signature in signatures:
match = signature.validate(signature.regex.match(fobj.read())))
if match: matches.append(match)
fobj.seek(0)
Is there a way to perform multiple regex matches on the same file object while only reading the file object content once. The file object can be large, so I cannot put it in memory.
Edit:
I want to clarify why I mean by "pushing data into regex". I recognize that regex has similarities with a finite state machine. Instead of passing the whole data at once to the regex engine, is it possible to push parts of it at a time?
while True:
data = fobj.read(1024)
if data == "": break
for signature in signatures:
match = signature.regex.push_and_match(data)
if match: matches.append(match)
Edit 2:
Removed link, as I removed the project from github.
The standard way to do this sort of text processing with files too large to read into memory is to iterate over the file line by line:
regexes = [ .... ]
with open('large.file.txt') as fh:
for line in fh:
for rgx in regexes:
m = rgx.search(line)
if m:
# Do stuff.
But that approach assumes your regexes can operate successfully on single lines of text in isolation. If they cannot, perhaps there are other units that you can pass to the regexes (eg, paragraphs delimited by blank lines). In other words, you might need to do a bit of pre-parsing so that you can grab meaningful sections of text before sending them to your main regexes.
with open('large.file.txt') as fh:
section = []
for line in fh:
if line.strip():
section.append(line)
else:
# We've hit the end of a section, so we
# should check it against our regexes.
process_section(''.join(section), regexes)
section = []
# Don't forget the last one.
if section:
process_section('\n'.join(section), regexes)
Regarding your literal question: "Is there a way to perform multiple regex matches on the same file object while only reading the file object content once". No and yes. No in the sense that Python regexes operate on strings, not file objects. But you can perform multiple regex searches at the same time on one string, simply by using alternation. Here's a minimal example:
patterns = 'aa bb cc'.split()
big_regex = '|'.join(patterns) # Match this or that or that.
m = big_regex.search(some_text)
But that doesn't really solve your problem if the file is too big for memory.
Maybe consider using re.findall() if you don't need match objects but only matched strings? If file is too big you can slice it for parts, as you suggest, but using some overlaps, to miss no regexes (if you know nature of regexes maybe it'll be possible to find out how big overlap should be).

Does a fast Python built-in method for reading lines and then splitting them exist?

This method works just fine in Python:
with open(file) as f:
for line in f:
for field in line.rstrip().split('\t'):
continue
However, it also means I read each line twice. First I loop over each character of the file and search for newline characters and second I loop over each character of the line and search for tab spaces. Is there a built-in method for splitting lines, while avoiding looping over the same set of characters twice? Apologies if this is a stupid question.
If you're worried about this level of efficiency then you probably shouldn't be programming in Python. Most of what is happening in that loop happens in C (if you're using the CPython implementation). You're not going to find a more efficient way to process your data using a pure python approach or without creating a very complicated looping structure.
If I wanted to avoid looping over the lines and handle the whole file in one go I would go with a regular expression. Also, regular expressions should be really fast.
import re
regexp = re.compile("\n+")
with open(file) as f:
lines = re.split(regexp, f.read())
Now \n matches one or more newlines and splits the file there. The results is a python list with all the lines. If you want to split by another character, for example whitespaces (and tabs and newlines) you would replace \n+ with \s+. Depending on what you want to do with the lines this might not be faster. Timeit is your friend.
More on pythons regexp:
https://docs.python.org/2/library/re.html

findall/finditer on a stream?

Is there a way to get the re.findall, or better yet, re.finditer functionality applied to a stream (i.e. an filehandle open for reading)?
Note that I am not assuming that the pattern to be matched is fully contained within one line of input (i.e. multi-line patterns are permitted). Nor am I assuming a maximum match length.
It is true that, at this level of generality, it is possible to specify a regex that would require that the regex engine have access to the entire string (e.g. r'(?sm).*'), and, of course, this means having to read the entire file into memory, but I am not concerned with this worst-case scenario at the moment. It is, after all, perfectly possible to write multi-line-matching regular expressions that would not require reading the entire file into memory.
Is it possible to access the underlying automaton (or whatever is used internally) from a compiled regex, to feed it a stream of characters?
Thanks!
Edit: Added clarifications regarding multi-line patterns and match lengths, in response to Tim Pietzcker's and rplnt's answers.
This is possible if you know that a regex match will never span a newline.
Then you can simply do
for line in file:
result = re.finditer(regex, line)
# do something...
If matches can extend over multiple lines, you need to read the entire file into memory. Otherwise, how would you know if your match was done already, or if some content further up ahead would make a match impossible, or if a match is only unsuccessful because the file hasn't been read far enough?
Edit:
Theoretically it is possible to do this. The regex engine would have to check whether at any point during the match attempt it reaches the end of the currently read portion of the stream, and if it does, read on ahead (possibly until EOF). But the Python engine doesn't do this.
Edit 2:
I've taken a look at the Python stdlib's re.py and its related modules. The actual generation of a regex object, including its .match() method and others is done in a C extension. So you can't access and monkeypatch it to also handle streams, unless you edit the C sources directly and build your own Python version.
It would be possible to implement on regexp with known maximum length. Either no +/* or ones where you know maximum numbers of repetition. If you know this you can read file by chunks and match on these, yielding the result. You would also run the regexp on overlapping chunk than would cover the case when the regexp would match but was stopped by the end of a string.
some pseudo(python)code:
overlap_tail = ''
matched = {}
for chunk in file.stream(chunk_size):
# calculate chunk_start
for result in finditer(match, overlap_tail+chunk):
if not chunk_start + result.start() in matched:
yield result
matched[chunk_start + result.start()] = result
# delete old results from dict
overlap_tail = chunk[-max_re_len:]
Just an idea but I hope you get what I'm trying to achieve. You'd need to consider that file(stream) could end and some other cases. But I think it can be done (if the length of the regular expression is limited(known)).

Speed of many regular expressions in python

I'm writing a python program that deals with a fair amount of strings/files. My problem is that I'm going to be presented with a fairly short piece of text, and I'm going to need to search it for instances of a fairly broad range of words/phrases.
I'm thinking I'll need to compile regular expressions as a way of matching these words/phrases in the text. My concern, however, is that this will take a lot of time.
My question is how fast is the process of repeatedly compiling regular expressions, and then searching through a small body of text to find matches? Would I be better off using some string method?
Edit: So, I guess an example of my question would be: How expensive would it be to compile and search with one regular expression versus say, iterating 'if "word" in string' say, 5 times?
You should try to compile all your regexps into a single one using the | operator. That way, the regexp engine will do most of the optimizations for you. Use the grouping operator () to determine which regexp matched.
If speed is of the essence, you are better off running some tests before you decide how to code your production application.
First of all, you said that you are searching for words which suggests that you may be able to do this using split() to break up the string on whitespace. And then use simple string comparisons to do your search.
Definitely do compile your regular expressions and do a timing test comparing that with the plain string functions. Check the documentation for the string class for a full list.
Your requirement appears to be searching a text for the first occurrence of any one of a collection of strings. Presumably you then wish to restart the search to find the next occurrence, and so on until the searched string is exhausted. Only plain old string comparison is involved.
The classic algorithm for this task is Aho-Corasick for which there is a Python extension (written in C). This should beat the socks off any alternative that's using the re module.
If you like to know how does it fast during compiling regex patterns, you need to benchmark it.
Here is how I do that. Its compile 1 Million time each patterns.
import time,re
def taken(f):
def wrap(*arg):
t1,r,t2=time.time(),f(*arg),time.time()
print t2-t1,"s taken"
return r
return wrap
#taken
def regex_compile_test(x):
for i in range(1000000):
re.compile(x)
print "for",x,
#sample tests
regex_compile_test("a")
regex_compile_test("[a-z]")
regex_compile_test("[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}")
Its took around 5 min for each patterns in my computer.
for a 4.88999986649 s taken
for [a-z] 4.70300006866 s taken
for [A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4} 4.78200006485 s taken
The real Bottleneck is not in compiling patterns, its in extracting text like re.findall, replacing re.sub. If you use that against Several MB texts, Its quite slow.
If your text is fixed, use normal str.find, its faster than regex.
Actually, If you give your text samples, and your regex patterns samples, we could give you better idea, there is many many great regex, and python guys out there.
Hope this help, sorry If my answer couldn't help you.
When you compile the regexp, it is converted into a state machine representation. Provided the regexp is efficiently expressed, it should still be very fast to match. Compiling the regexp can be expensive though, so you will want to do that up front, and as infrequently as possible. Ultimately though, only you can answer if it is fast enough for your requirements.
There are other string searching approaches, such as the Boyer-Moore algorithm. But I'd wager the complexity of searching for multiple separate strings is much higher than a regexp that can switch off each successive character.
This is a question that can readily be answered by just trying it.
>>> import re
>>> import timeit
>>> find = ['foo', 'bar', 'baz']
>>> pattern = re.compile("|".join(find))
>>> with open('c:\\temp\\words.txt', 'r') as f:
words = f.readlines()
>>> len(words)
235882
>>> timeit.timeit('r = filter(lambda w: any(s for s in find if w.find(s) >= 0), words)', 'from __main__ import find, words', number=30)
18.404569854548527
>>> timeit.timeit('r = filter(lambda w: any(s for s in find if s in w), words)', 'from __main__ import find, words', number=30)
10.953313759150944
>>> timeit.timeit('r = filter(lambda w: pattern.search(w), words)', 'from __main__ import pattern, words', number=30)
6.8793022576891758
It looks like you can reasonably expect regular expressions to be faster than using find or in. Though if I were you I'd repeat this test with a case that was more like your real data.
If you're just searching for a particular substring, use str.find() instead.
Depending on what you're doing it might be better to use a tokenizer and loop through the tokens to find matches.
However, when it comes to short pieces of text regexes have incredibly good performance. Personally I remember only coming into problems when text sizes became ridiculous like 100k words or something like that.
Furthermore, if you are worried about the speed of actual regex compilation rather than matching, you might benefit from creating a daemon that compiles all the regexes then goes through all the pieces of text in a big loop or runs as a service. This way you will only have to compile the regexes once.
in general case, you can use "in" keyword
for line in open("file"):
if "word" in line:
print line.rstrip()
regex is usually not needed when you use Python :)

Categories

Resources