Python: find regexp in a file - python

Have:
f = open(...)
r = re.compile(...)
Need:
Find the position (start and end) of a first matching regexp in a big file?
(starting from current_pos=...)
How can I do this?
I want to have this function:
def find_first_regex_in_file(f, regexp, start_pos=0):
f.seek(start_pos)
.... (searching f for regexp starting from start_pos) HOW?
return [match_start, match_end]
File 'f' is expected to be big.

One way to search through big files is to use the mmap library to map the file into a big memory chunk. Then you can search through it without having to explicitly read it.
For example, something like:
size = os.stat(fn).st_size
f = open(fn)
data = mmap.mmap(f.fileno(), size, access=mmap.ACCESS_READ)
m = re.search(r"867-?5309", data)
This works well for very big files (I've done it for a file 30+ GB in size, but you'll need a 64-bit OS if your file is more than a GB or two).

The following code works reasonably well with test files around 2GB in size.
def search_file(pattern, filename, offset=0):
with open(filename) as f:
f.seek(offset)
for line in f:
m = pattern.search(line)
if m:
search_offset = f.tell() - len(line) - 1
return search_offset + m.start(), search_offset + m.end()
Note that the regular expression must not span multiple lines.

NOTE: this has been tested on python2.7. You may have to tweak things in python 3 to handle strings vs bytes but it shouldn't be too painful hopefully.
Memory mapped files may not be ideal for your situation (32-bit mode increases chance there isn't enough contiguous virtual memory, can't read from pipes or other non-files, etc).
Here is a solution that reads 128k blocks at a time and as long as your regex matches a string smaller than that size, this will work. Also note you are not restricted by using single-line regexes. This solution works plenty fast, although I suspect it will be marginally slower than using mmap. It probably depends more on what you're doing with the matches, as well as the size/complexity of the regex you're searching for.
The method will make sure to only keep a maximum of 2 blocks in memory. You might want to enforce at least 1 match per block as a sanity check in some use cases, but this method will truncate in order to keep the maximum of 2 blocks in memory. It also makes sure that any regex match that eats to the end of the current block is NOT yielded and instead the last position is saved for when either the true input is exhausted or we have another block that the regex matches before the end of, in order to better match patterns like "[^\n]+" or "xxx$". You may still be able to break things if you have a lookahead at the end of the regex like xx(?!xyz) where yz is in the next block, but in most cases you can work around using such patterns.
import re
def regex_stream(regex,stream,block_size=128*1024):
stream_read=stream.read
finditer=regex.finditer
block=stream_read(block_size)
if not block:
return
lastpos = 0
for mo in finditer(block):
if mo.end()!=len(block):
yield mo
lastpos = mo.end()
else:
break
while True:
new_buffer = stream_read(block_size)
if not new_buffer:
break
if lastpos:
size_to_append=len(block)-lastpos
if size_to_append > block_size:
block='%s%s'%(block[-block_size:],new_buffer)
else:
block='%s%s'%(block[lastpos:],new_buffer)
else:
size_to_append=len(block)
if size_to_append > block_size:
block='%s%s'%(block[-block_size:],new_buffer)
else:
block='%s%s'%(block,new_buffer)
lastpos = 0
for mo in finditer(block):
if mo.end()!=len(block):
yield mo
lastpos = mo.end()
else:
break
if lastpos:
block=block[lastpos:]
for mo in finditer(block):
yield mo
To test / explore, you can run this:
# NOTE: you can substitute a real file stream here for t_in but using this as a test
t_in=cStringIO.StringIO('testing this is a 1regexxx\nanother 2regexx\nmore 3regexes')
block_size=len('testing this is a regex')
re_pattern=re.compile(r'\dregex+',re.DOTALL)
for match_obj in regex_stream(re_pattern,t_in,block_size=block_size):
print 'found regex in block of len %s/%s: "%s[[[%s]]]%s"'%(
len(match_obj.string),
block_size,match_obj.string[:match_obj.start()].encode('string_escape'),
match_obj.group(),
match_obj.string[match_obj.end():].encode('string_escape'))
Here is the output:
found regex in block of len 46/23: "testing this is a [[[1regexxx]]]\nanother 2regexx\nmor"
found regex in block of len 46/23: "testing this is a 1regexxx\nanother [[[2regexx]]]\nmor"
found regex in block of len 14/23: "\nmore [[[3regex]]]es"
This can be useful in conjunction with quick-parsing a large XML where it can be split up into mini-DOMs based on a sub element as root, instead of having to dive into handling callbacks and states when using a SAX parser. It also allows you to filter through XML faster as well. But I've used it for tons of other purposes as well. I'm kind of surprised recipes like this aren't more readily available on the net!
One more thing: Parsing in unicode should work as long as the passed in stream is producing unicode strings, and if you're using the character classes like \w, you'll need to add the re.U flag to the re.compile pattern construction. In this case block_size actually means character count instead of byte count.

Related

How can I efficiently search for many strings at once in many files?

Hi i post my portion of code and then i'll explain my goal:
for eachcsv in matches:
with open(eachcsv, 'r') as f:
lines = f.readlines()
for entry in rs:
for line in lines:
if entry in line:
print("found %s in %s" % (entry, eachcsv))
So in "matches" i got a list of csv files (the path to them). I open every csv file and i load them into memory with readlines(). "rs" is a list of unique ids. For every element of the list "rs" i need to search every line of the csv file and print each time i find the id on the file (later on i will test if the line contains another fixed word also).
The code above works for my purpose but i don't know why it takes more than 10 minutes to process a 400k row file, i need to do this task for thousand of files so it's impossible for me to finish the task. It seems to me that the slow part is the testing process, but i'm not sure.
Please note that i use python because i'm more confortable with it, if there is ANY other solution to my problem using other tools i'm ok with it.
EDIT:
i will try to post some examples
"rs" list:
rs12334435
rs3244567
rs897686
....
files
# header data not needed
# data
# data
# data
# data
# data [...]
#COLUMN1 COLUMN2 COLUMN3 ...
data rs7854.rs165463 dataSS=1(random_data)
data rs465465data datadata
data rs798436 dataSS=1
data datars45648 dataSS=1
The final goal is to count how many times every rs appers on each file and if in column 3 there is SS=1 to flag it in the output.
Something like
found rs12345 SS yes file 3 folder /root/foobar/file
found rs74565 SS no file 3 folder /root/foobar/file
Much of the problem is because you have so many nested loops. You can probably make your program faster by eliminating loops:
One of the loops is over each of the lines in the file. But if all
you want to do is determine whether any of the matches exists in the
file, you can search the whole file body in one operation. To be
sure, this searches a longer string, but it does so in one operation
in native code instead of doing it in Python.
You loop over all the match strings. But you know those before you
start and they are the same for each file. So this is a good case
of when doing-more up-front work will pay off in time saved in the
rest of the program. Stand back, I'm going to use a regular
expression.
Here is a version of the code which combines these two ideas:
import re
import random
import sys
import time
# get_patterns makes up some test data.
def get_patterns():
rng = random.Random(1) # fixed seed, for reproducibility
n = 300
# Generate up to n unique integers between 60k and 80k.
return list(set([str(rng.randint(60000, 80000)) for _ in xrange(n)]))
def original(rs, matches):
for eachcsv in matches:
with open(eachcsv, 'r') as f:
lines = f.readlines()
for entry in rs:
for line in lines:
if entry in line:
print("found %s in %s" % (entry, eachcsv))
def mine(rs, matches):
my_rx = re.compile(build_regex(rs))
for eachcsv in matches:
with open(eachcsv, 'r') as f:
body = f.read()
matches = my_rx.findall(body)
for match in matches:
print "found %s in %s" % (match, eachcsv)
def build_regex(literal_patterns):
return "|".join([re.escape(pat) for pat in literal_patterns])
def print_elapsed_time(label, callable, args):
t1 = time.time()
callable(*args)
t2 = time.time()
elapsed_ms = (t2 - t1) * 1000
print "%8s: %9.1f milliseconds" % (label, elapsed_ms)
def main(args):
rs = get_patterns()
filenames = args[1:]
for function_name_and_function in (('original', original), ('mine', mine)):
name, func = function_name_and_function
print_elapsed_time(name, func, [rs, filenames])
return 0
if __name__ == '__main__':
sys.exit(main(sys.argv))
Your original code is in there as original and my replacement is mine.
For 300 patterns, my implementation runs in 400ms on my computer. which is roughly a 30x speedup. For more match strings, the improvement should be greater. Doubling the number of patterns roughly doubles the runtime of your implementation, but the regex-based one only takes about 3% longer (though this is partly because my test patterns all have similar prefixes and that may not be true for the real data).
Edit: updated code to print one message for each match. My code is now somewhat slower, but it's still an improvement, and should be relatively speaking faster with more strings to match:
~/source/stackoverflow/36923237$ python search.py example.csv
found green fox in example.csv
original: 9218.0 milliseconds
found green fox in example.csv
mine: 600.4 milliseconds
Edit: explanation of the regex technique, as requested.
Suppose you want to search your file for the strings foobar and umspquux. One way to do this is to search the file for first foobar and then umspquux. This is the approach you started with.
Another approach is to search for both strings at once. Imagine that you check the first character of the file. If it's 'f' or 'u', you might be looking at a match, and should check the second character to see if it, respectively 'o' or 'm'. And so on. If you get to the end od of the file, you will have found all the matches there qre in the file to find.
A convenient way to tell a computer to look for multiple strings at once is to use a regular expression. Normal strings are regular expressions. The regular expression 'foobar' matches the sub-string 'foobar' inside 'the foobar is all fuzzy'. However, you can do more complex things. You can combine two regular expressions, each of which matches something, into a combined regular expression which will match either of those somethings. This is done with the alternation symbol, '|'. So the regular expression 'foobar|umspquux' will match either 'foobar' or 'umspquux'. You can also match a real '|' by escaping the significance of the '|' with a backslash '\'.
This is what the build_regex_literal_patterns is all about. It would convert the list ['foobar', 'umspquux'] into the string 'foobar|umspquux'. Try it out - put the function definition into your own file and call it with some try-out arguments to see how it behaves.
That's a good way, by the way, to figure out how any piece of code works - run part of it and print the intermediate result. This is harder to do safely with programs that have side-effects of course, but this program doesn't.
The call to re.escape in build_regex_literal_patterns simply ensures that any special regular expression operators (such as '|') are escaped (giving in this case '\|') so that they will match themselves.
The last part of the regular expression method here is to use the findall method of the compiled regular expression. This simply returns all the matches for our regular expression in the input string (i.e. the body of the file).
You can read up on Python regular expressions in the Python documentation on regular expressions. That documentation is basically reference material, so you might find the gentler introduction at the Google Develoeprs site has an introduction to Python regular expressions better as a starting point. Jeffrey Friedl's Mastering Regular Expressions is a pretty comprehensive work on regular expressions, thought it doesn't happen to cover the Python dialect of regular expressions.
Your largest memory intensive operation is going to be reading every line, this should be on the outer loop.
You'll also not want to read the entire document in at once with readlines, use readline instead. Readline is much less memory intensive.
for eachcsv in matches:
with open(eachcsv, 'r') as f:
for line in f:
for entry in rs:
if entry in line:
print("found %s in %s" % (entry, eachcsv))
More reading here https://docs.python.org/2/tutorial/inputoutput.html
There are other optimizations you can take too that fall outside the scope of this question, such as using threading, or using multi processing to read in many csv files at the same time. https://docs.python.org/2/library/threading.html
import threading
def findRSinCSVFile(csvFile,rs)
with open(csvFile, 'r') as f:
for line in f:
for entry in rs:
if entry in line:
print("found %s in %s" % (entry, eachcsv))
for csvFile in csvFiles():
threads.append(threading.Thread(target=findRSinCSVFile,args=(csvFile,rs)))
for thread in threads:
thread.start()
for thread in threads:
thread.join()
This will allow you to parse all the csv files at the same time.
(Note I have not tested any of this code, and only serves as an example)

Regular Expression String Mangling Efficiency in Python - Explanation for Slowness?

I'm hoping someone can help explain why Python's re module seems to be so slow at chopping up a very large string for me.
I have string ("content") that is very nearly 600k bytes in size. I'm trying to hack off just the beginning part of it, a variable number of lines, delimited by the text ">>>FOOBAR<<<".
The literal completion time is provided for comparison purposes - the script that this snippet is in takes a bit to run naturally.
The first and worst method:
import re
content = "Massive string that is 600k and contains >>>FOOBAR<<< about 200 lines in"
content = re.sub(".*>>>FOOBAR<<<", ">>>FOOBAR<<<", content, flags=re.S)
Has a completion time of:
real 6m7.213s
While a wordy method:
content = "Massive string that is 600k and contains >>>FOOBAR<<< about 200 lines in"
newstir = ""
flag = False
for l in content.split('\n'):
if re.search(">>>FOOBAR<<<", l):
flag = True
#End if we encountered our flag line
if flag:
newstir += l
#End loop through content
content = newstir
Has an expected completion time of:
real 1m5.898s
And using a string's .split method:
content = "Massive string that is 600k and contains >>>FOOBAR<<< about 200 lines in"
content = content.split(">>>FOOBAR<<<")[1]
Also has an expected completion time of:
real 1m6.427s
What's going on here? Why is my re.sub call so ungodly slow for the same string?
There is no good way to do it with a pattern starting either with .* or .*? in particular with large data, since the first will cause a lot of backtracking and the second must test for each taken character if the following subpattern fails (until it succeeds). Using a non-greedy quantifier isn't faster than using a greedy quantifier.
I suspect that your ~600k content data are in a file at the beginning. Instead of loading the whole file and storing its content to a variable, work line by line. In this way you will preserve memory and avoid to split and to create a list of lines. Second thing, if you are looking for a literal string, don't use a regex method, use a simple string method like find that is faster:
with open('yourfile') as fh:
for line in fh:
result += line
if line.find('>>>FOOBAR<<<') > -1:
break
If >>>FOOBAR<<< isn't a simple literal string but a regex pattern, in this case compile the pattern before:
pat = re.compile(r'>>>[A-Z]+<<<')
with open('yourfile') as fh:
for line in fh:
result += line
if pat.search(line):
break

Pushing data into regex?

I am working on a small project which I call pydlp. It is basically a set of regex signatures that will extract data from a fileobject. And a function that check if extracted data is in fact interesting.
This code is how I perform matching. It is far from optimal, as I have to read the file over and over again.
for signature in signatures:
match = signature.validate(signature.regex.match(fobj.read())))
if match: matches.append(match)
fobj.seek(0)
Is there a way to perform multiple regex matches on the same file object while only reading the file object content once. The file object can be large, so I cannot put it in memory.
Edit:
I want to clarify why I mean by "pushing data into regex". I recognize that regex has similarities with a finite state machine. Instead of passing the whole data at once to the regex engine, is it possible to push parts of it at a time?
while True:
data = fobj.read(1024)
if data == "": break
for signature in signatures:
match = signature.regex.push_and_match(data)
if match: matches.append(match)
Edit 2:
Removed link, as I removed the project from github.
The standard way to do this sort of text processing with files too large to read into memory is to iterate over the file line by line:
regexes = [ .... ]
with open('large.file.txt') as fh:
for line in fh:
for rgx in regexes:
m = rgx.search(line)
if m:
# Do stuff.
But that approach assumes your regexes can operate successfully on single lines of text in isolation. If they cannot, perhaps there are other units that you can pass to the regexes (eg, paragraphs delimited by blank lines). In other words, you might need to do a bit of pre-parsing so that you can grab meaningful sections of text before sending them to your main regexes.
with open('large.file.txt') as fh:
section = []
for line in fh:
if line.strip():
section.append(line)
else:
# We've hit the end of a section, so we
# should check it against our regexes.
process_section(''.join(section), regexes)
section = []
# Don't forget the last one.
if section:
process_section('\n'.join(section), regexes)
Regarding your literal question: "Is there a way to perform multiple regex matches on the same file object while only reading the file object content once". No and yes. No in the sense that Python regexes operate on strings, not file objects. But you can perform multiple regex searches at the same time on one string, simply by using alternation. Here's a minimal example:
patterns = 'aa bb cc'.split()
big_regex = '|'.join(patterns) # Match this or that or that.
m = big_regex.search(some_text)
But that doesn't really solve your problem if the file is too big for memory.
Maybe consider using re.findall() if you don't need match objects but only matched strings? If file is too big you can slice it for parts, as you suggest, but using some overlaps, to miss no regexes (if you know nature of regexes maybe it'll be possible to find out how big overlap should be).

Regex re.findall() hangs - What if you cant read line by line

I have multiple files, each of which I am searching for a sequence of words.
My regex expression basically searches for a sequence where word1 is followed by word2 followed by word 3 etc..
So the expression looks like:
strings = re.findall('word1.*?word2.*?word3', f.read(), re.DOTALL)
For files below 20kb, the expression executes pretty well. However, the execution time exponentially increases for files > 20 kb and the process completely hangs for files close to 100kb.
It appears (after having read previous threads) that the problem is to do with using .* in conjunction with re.DOTALL - leading to "catastrophic backtracking". The recommended solution was to provide the input file line by line instead of reading the whole file into a single memory buffer.
However, my input file is filled with random whitespace and "\n" newline characters. My word sequence is also long and occurs over multiple lines. Therefore, I need to input the whole file together into the regex expression in conjunction with re.DOTALL - otherwise a line by line search will never find my sequence.
Is there any way around it?
If you're literally searching for the occurrence of three words, with no regex patterns in them at all, there's no need to use regexes at all – as suggested by #Bart as I wrote this answer :). Something like this might work (untested, and can probably be prettier):
with open('...') as f:
contents = f.read()
words = ['word1', 'word2', 'word3']
matches = []
start_idx = 0
try:
while True:
cand = []
for word in words:
word_idx = contents.index(word, start_idx)
cand.append(word_idx)
start_idx = word_idx + len(word)
matches.append(cand)
except ValueError: # from index() failing
pass
This puts the indices in matches; if you want an equivalent result to findall, you could do, say,
found = [contents[match[0]:match[-1]+len(words[-1]] for match in matches]
You could also make this kind of approach work without reading the whole file in beforehand by replacing the call to index with an equivalent function on files. I don't think the stdlib includes such a function; you'd probably have to manually use readline() and tell() or similar methods on file objects.
The reason this happens is because python's regex engine uses backtracking. At every .*, if the following word is not found, the engine must go all the way to the end of the string (100kb) and then backtrack. Now consider what happens if there are many "almost matches" after the last match. The engine keeps jumping back and forth from the start of the match to the end of the string.
You can fix it by using a regex engine based on an NFA rather than backtracking. Note that this limits the kinds of regexes you can use (no backtracking or arbitrary zero-width assertions), but it's fine for your use case.
You can find such an engine here. You can visualize how an nfa engine works at www.debuggex.com.
You can use a loop to search for one word at a time. I'm using str.find() here as it is faster for simple substring search, but you can also adapt this code to work with re.search() instead.
def findstrings(text, words):
end = 0
while True:
start = None
for word in words:
pos = text.find(word, end) #starts from position end
if pos < 0:
return
if start is None:
start = pos
end = pos + len(word)
yield text[start:end]
#usage in place of re.findall('word1.*?word2.*?word3', f.read(), re.DOTALL)
list(findstrings(f.read(), ['word1', 'word2', 'word3']))

findall/finditer on a stream?

Is there a way to get the re.findall, or better yet, re.finditer functionality applied to a stream (i.e. an filehandle open for reading)?
Note that I am not assuming that the pattern to be matched is fully contained within one line of input (i.e. multi-line patterns are permitted). Nor am I assuming a maximum match length.
It is true that, at this level of generality, it is possible to specify a regex that would require that the regex engine have access to the entire string (e.g. r'(?sm).*'), and, of course, this means having to read the entire file into memory, but I am not concerned with this worst-case scenario at the moment. It is, after all, perfectly possible to write multi-line-matching regular expressions that would not require reading the entire file into memory.
Is it possible to access the underlying automaton (or whatever is used internally) from a compiled regex, to feed it a stream of characters?
Thanks!
Edit: Added clarifications regarding multi-line patterns and match lengths, in response to Tim Pietzcker's and rplnt's answers.
This is possible if you know that a regex match will never span a newline.
Then you can simply do
for line in file:
result = re.finditer(regex, line)
# do something...
If matches can extend over multiple lines, you need to read the entire file into memory. Otherwise, how would you know if your match was done already, or if some content further up ahead would make a match impossible, or if a match is only unsuccessful because the file hasn't been read far enough?
Edit:
Theoretically it is possible to do this. The regex engine would have to check whether at any point during the match attempt it reaches the end of the currently read portion of the stream, and if it does, read on ahead (possibly until EOF). But the Python engine doesn't do this.
Edit 2:
I've taken a look at the Python stdlib's re.py and its related modules. The actual generation of a regex object, including its .match() method and others is done in a C extension. So you can't access and monkeypatch it to also handle streams, unless you edit the C sources directly and build your own Python version.
It would be possible to implement on regexp with known maximum length. Either no +/* or ones where you know maximum numbers of repetition. If you know this you can read file by chunks and match on these, yielding the result. You would also run the regexp on overlapping chunk than would cover the case when the regexp would match but was stopped by the end of a string.
some pseudo(python)code:
overlap_tail = ''
matched = {}
for chunk in file.stream(chunk_size):
# calculate chunk_start
for result in finditer(match, overlap_tail+chunk):
if not chunk_start + result.start() in matched:
yield result
matched[chunk_start + result.start()] = result
# delete old results from dict
overlap_tail = chunk[-max_re_len:]
Just an idea but I hope you get what I'm trying to achieve. You'd need to consider that file(stream) could end and some other cases. But I think it can be done (if the length of the regular expression is limited(known)).

Categories

Resources