Regex re.findall() hangs - What if you cant read line by line - python

I have multiple files, each of which I am searching for a sequence of words.
My regex expression basically searches for a sequence where word1 is followed by word2 followed by word 3 etc..
So the expression looks like:
strings = re.findall('word1.*?word2.*?word3', f.read(), re.DOTALL)
For files below 20kb, the expression executes pretty well. However, the execution time exponentially increases for files > 20 kb and the process completely hangs for files close to 100kb.
It appears (after having read previous threads) that the problem is to do with using .* in conjunction with re.DOTALL - leading to "catastrophic backtracking". The recommended solution was to provide the input file line by line instead of reading the whole file into a single memory buffer.
However, my input file is filled with random whitespace and "\n" newline characters. My word sequence is also long and occurs over multiple lines. Therefore, I need to input the whole file together into the regex expression in conjunction with re.DOTALL - otherwise a line by line search will never find my sequence.
Is there any way around it?

If you're literally searching for the occurrence of three words, with no regex patterns in them at all, there's no need to use regexes at all – as suggested by #Bart as I wrote this answer :). Something like this might work (untested, and can probably be prettier):
with open('...') as f:
contents = f.read()
words = ['word1', 'word2', 'word3']
matches = []
start_idx = 0
try:
while True:
cand = []
for word in words:
word_idx = contents.index(word, start_idx)
cand.append(word_idx)
start_idx = word_idx + len(word)
matches.append(cand)
except ValueError: # from index() failing
pass
This puts the indices in matches; if you want an equivalent result to findall, you could do, say,
found = [contents[match[0]:match[-1]+len(words[-1]] for match in matches]
You could also make this kind of approach work without reading the whole file in beforehand by replacing the call to index with an equivalent function on files. I don't think the stdlib includes such a function; you'd probably have to manually use readline() and tell() or similar methods on file objects.

The reason this happens is because python's regex engine uses backtracking. At every .*, if the following word is not found, the engine must go all the way to the end of the string (100kb) and then backtrack. Now consider what happens if there are many "almost matches" after the last match. The engine keeps jumping back and forth from the start of the match to the end of the string.
You can fix it by using a regex engine based on an NFA rather than backtracking. Note that this limits the kinds of regexes you can use (no backtracking or arbitrary zero-width assertions), but it's fine for your use case.
You can find such an engine here. You can visualize how an nfa engine works at www.debuggex.com.

You can use a loop to search for one word at a time. I'm using str.find() here as it is faster for simple substring search, but you can also adapt this code to work with re.search() instead.
def findstrings(text, words):
end = 0
while True:
start = None
for word in words:
pos = text.find(word, end) #starts from position end
if pos < 0:
return
if start is None:
start = pos
end = pos + len(word)
yield text[start:end]
#usage in place of re.findall('word1.*?word2.*?word3', f.read(), re.DOTALL)
list(findstrings(f.read(), ['word1', 'word2', 'word3']))

Related

How do I build a tokenizing regex based iterator in python

I'm basing this question on an answer I gave to this other SO question, which was my specific attempt at a tokenizing regex based iterator using more_itertools's pairwise iterator recipe.
Following is my code taken from that answer:
from more_itertools import pairwise
import re
string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer(r"^|[ ]+|$", string)):
print(string[prev.end(): curr.start()]) # originally I yield here
I then noticed that if the string starts or ends with delimiters (i.e. string = " dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d ") then the tokenizer will print empty strings (these are actually extra matches to string start and string end) in the beginning and end of its list of token outputs so to remedy this I tried the following (quite ugly) attempts at other regexes:
"(?:^|[ ]|$)+" - this seems quite simple and like it should work but it doesn't (and also seems to behave wildly different on other regex engines) for some reason it wouldn't build a single match from the string's start and the delimiters following it, the string start somehow also consumes the character following it! (this is also where I see divergence from other engines, is this a BUG? or does it have something to do with special non corporeal characters and the or (|) operator in python that I'm not aware of?), this solution also did nothing for the double match containing the string's end, once it matched the delimiters and then gave another match for the string end ($) character itself.
"(?:[ ]|$|^)+" - Putting the delimiters first actually solves one of the problems, the split at the beginning doesn't contain string start (but I don't care too much about that anyway since I'm interested in the tokens themselves), it also matches string start when there are no delimiters at the beginning of the string but the string ending is still a problem.
"(^[ ]*)|([ ]*$)|([ ]+)" - This final attempt got the string start to be part of the first match (which wasn't really that much of a problem in the first place) but try as I might I couldn't get rid of the delimiter + end and then delimiter match problem (which yields an additional empty string), still, I'm showing you this example (with grouping) since it shows that the ending special character $ is matched twice, once with the preceding delimiters and once by itself (2 group 2 matches).
My questions are:
Why do I get such a strange behavior in attempt #1
How do I solve the end of string issue?
Am I being a tank, i.e. is there a simple way to solve this that I'm blindly missing?
remember that the solution can't change the string and must
produce an iterable generator which iterates on the spaces between the tokens and not the tokens themselves (This last part might seem to complicate the answer unnecessarily since otherwise I have a simple answer but if you must know (and if you don't read no further) it's part of a bigger framework I'm building where this yielding method is inherited by a pipeline which then constructs yielded sentences out of it in various patterns which are used to extract fields from semi structured classifier driven messages)
The problems you're having are due to the trickiness and undocumented edge cases of zero-width matches. You can resolve them by using negative lookarounds to explicitly tell Python not to produce a match for ^ or $ if the string has delimiters at the start or end:
delimiter_re = r'[\n\- ]' # newline, hyphen, or space
search_regex = r'''^(?!{0}) # string start with no delimiter
| # or
{0}+ # sequence of delimiters (at least one)
| # or
(?<!{0})$ # string end with no delimiter
'''.format(delimiter_re)
search_pattern = re.compile(search_regex, re.VERBOSE)
Note that this will produce one match in an empty string, not zero, and not separate beginning and ending matches.
It may be simpler to iterate over non-delimiter sequences and use the resulting matches to locate the string components you want:
token = re.compile(r'[^\n\- ]+')
previous_end = 0
for match in token.finditer(string):
do_something_with(string[previous_end:match.start()])
previous_end = match.end()
do_something_with(string[previous_end:])
The extra matches you were getting at the end of the string were because after matching the sequence of delimiters at the end, the regex engine looks for matches at the end again, and finds a zero-width match for $.
The behavior you were getting at the beginning of the string for the ^|... pattern is trickier: the regex engine sees a zero-width match for ^ at the start of the string and emits it, without trying the other | alternatives. After the zero-width match, the engine needs to avoid producing that match again to avoid an infinite loop; this particular engine appears to do that by skipping a character, but the details are undocumented and the source is hard to navigate. (Here's part of the source, if you want to read it.)
The behavior you were getting at the start of the string for the (?:^|...)+ pattern is even trickier. Executing this straightforwardly, the engine would look for a match for (?:^|...) at the start of the string, find ^, then look for another match, find ^ again, then look for another match ad infinitum. There's some undocumented handling that stops it from going on forever, and this handling appears to produce a zero-width match, but I don't know what that handling is.
It sounds like you're just trying to return a list of all the "words" separated by any number of deliminating chars. You could instead just use regex groups and the negation regex ^ to achieve this:
# match any number of consecutive non-delim chars
string = " dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d "
delimiters = '\n\- '
regex = r'([^{0}]+)'.format(delimiters)
for match in re.finditer(regex, string):
print(match.group(0))
output:
dasdha
hasud
hasuid
hsuia
dhsuai
dhasiu
dhaui
d

Regular Expression String Mangling Efficiency in Python - Explanation for Slowness?

I'm hoping someone can help explain why Python's re module seems to be so slow at chopping up a very large string for me.
I have string ("content") that is very nearly 600k bytes in size. I'm trying to hack off just the beginning part of it, a variable number of lines, delimited by the text ">>>FOOBAR<<<".
The literal completion time is provided for comparison purposes - the script that this snippet is in takes a bit to run naturally.
The first and worst method:
import re
content = "Massive string that is 600k and contains >>>FOOBAR<<< about 200 lines in"
content = re.sub(".*>>>FOOBAR<<<", ">>>FOOBAR<<<", content, flags=re.S)
Has a completion time of:
real 6m7.213s
While a wordy method:
content = "Massive string that is 600k and contains >>>FOOBAR<<< about 200 lines in"
newstir = ""
flag = False
for l in content.split('\n'):
if re.search(">>>FOOBAR<<<", l):
flag = True
#End if we encountered our flag line
if flag:
newstir += l
#End loop through content
content = newstir
Has an expected completion time of:
real 1m5.898s
And using a string's .split method:
content = "Massive string that is 600k and contains >>>FOOBAR<<< about 200 lines in"
content = content.split(">>>FOOBAR<<<")[1]
Also has an expected completion time of:
real 1m6.427s
What's going on here? Why is my re.sub call so ungodly slow for the same string?
There is no good way to do it with a pattern starting either with .* or .*? in particular with large data, since the first will cause a lot of backtracking and the second must test for each taken character if the following subpattern fails (until it succeeds). Using a non-greedy quantifier isn't faster than using a greedy quantifier.
I suspect that your ~600k content data are in a file at the beginning. Instead of loading the whole file and storing its content to a variable, work line by line. In this way you will preserve memory and avoid to split and to create a list of lines. Second thing, if you are looking for a literal string, don't use a regex method, use a simple string method like find that is faster:
with open('yourfile') as fh:
for line in fh:
result += line
if line.find('>>>FOOBAR<<<') > -1:
break
If >>>FOOBAR<<< isn't a simple literal string but a regex pattern, in this case compile the pattern before:
pat = re.compile(r'>>>[A-Z]+<<<')
with open('yourfile') as fh:
for line in fh:
result += line
if pat.search(line):
break

efficient way to get words before and after substring in text (python)

I'm using regex to find occurrences of string patterns in a body of text. Once I find that the string pattern occurs, I want to get x words before and after the string as well (x could be as small as 4, but preferably ~10 if still as efficient).
I am currently using regex to find all instances, but occasionally it will hang. Is there a more efficient way to solve this problem?
This is the solution I currently have:
sub = r'(\w*)\W*(\w*)\W*(\w*)\W*(\w*)\W*(%s)\W*(\w*)\W*(\w*)\W*(\w*)\W*(\w*)' % result_string #refind string and get surrounding += 4 words
surrounding_text = re.findall(sub, text)
for found_text in surrounding_text:
result_found.append(" ".join(map(str,found_text)))
I'm not sure if this is what you're looking for:
>>> text = "Hello, world. Regular expressions are not always the answer."
>>> words = text.partition("Regular expressions")
>>> words
('Hello, world. ', 'Regular expressions', ' are not always the answer.')
>>> words_before = words[0]
>>> words_before
'Hello, world. '
>>> separator = words[1]
>>> separator
'Regular expressions'
>>> words_after = words[2]
>>> words_after
' are not always the answer.'
Basically, str.partition() splits the string into a 3-element tuple. In this example, the first element is all of the words before the specific "separator", the second element is the separator, and the third element is all of the words after the separator.
The main problem with your pattern is that it begins with optional things that causes a lot of tries for each positions in the string until a match is found. The number of tries increases with the text size and with the value of n (the number of words before and after). This is why only few lines of text suffice to crash your code.
A way consists to begin the pattern with the target word and to use lookarounds to capture the text (or the words) before and after:
keyword (?= words after ) (?<= words before - keyword)
Starting a pattern with the searched word (a literal string) makes it very fast, and words around are then quickly found from this position in the string. Unfortunately the re module has some limitations and doesn't allow variable length lookbehinds (as many other regex flavors).
The new regex module supports variable length lookbehinds and other useful features like the ability to store the matches of a repeated capture group (handy to get the separated words in one shot).
import regex
text = '''In strange contrast to the hardly tolerable constraint and nameless
invisible domineerings of the captain's table, was the entire care-free
license and ease, the almost frantic democracy of those inferior fellows
the harpooneers. While their masters, the mates, seemed afraid of the
sound of the hinges of their own jaws, the harpooneers chewed their food
with such a relish that there was a report to it.'''
word = 'harpooneers'
n = 4
pattern = r'''
\m (?<target> %s ) \M # target word
(?<= # content before
(?<before> (?: (?<wdb>\w+) \W+ ){0,%d} )
%s
)
(?= # content after
(?<after> (?: \W+ (?<wda>\w+) ){0,%d} )
)
''' % (word, n, word, n)
rgx = regex.compile(pattern, regex.VERBOSE | regex.IGNORECASE)
class Result(object):
def __init__(self, m):
self.target_span = m.span()
self.excerpt_span = (m.starts('before')[0], m.ends('after')[0])
self.excerpt = m.expandf('{before}{target}{after}')
self.words_before = m.captures('wdb')[::-1]
self.words_after = m.captures('wda')
results = [Result(m) for m in rgx.finditer(text)]
print(results[0].excerpt)
print(results[0].excerpt_span)
print(results[0].words_before)
print(results[0].words_after)
print(results[1].excerpt)
Making a regex (well, anything, for that matter) with "as much repetitions as you will ever possibly need" is an extremely bad idea. That's because you
do an excessive amount of needless work every time
cannot really know for sure how much you will ever possibly need, thus introducing an arbitrary limitation
The bottom line for the below solutions: the 1st solution is the most effective one for large data; the 2nd one is the closest to your current, but scales much worse.
strip your entities to exactly what you are interested in at each moment:
find the substring (e.g. str.index. For whole words only, re.find with e.g. r'\b%s\b'%re.escape(word) is more suitable)
go N words back.
Since you mentioned a "text", your strings are likely to be very large, so you want to avoid copying potentially unlimited chunks of them.
E.g. re.finditer over a substring-reverse-iterator-in-place according to slices to immutable strings by reference and not copy and Best way to loop over a python string backwards. This will only become better than slicing when the latter is expensive in terms of CPU and/or memory - test on some realistic examples to find out. Doesn't work. re works directly with the memory buffer. Thus it's impossible to reverse a string for it without copying the data.
There's no function to find a character from a class in Python, nor an "xsplit". So the fastest way appears to be (i for i,c in enumerate(reversed(buffer(text,0,substring_index)) if c.isspace()) (timeit gives ~100ms on P3 933MHz for a full pass through a 100k string).
Alternatively:
Fix your regex to not be subject to catastrophic backtracking and eliminate code duplication (DRY principle).
The 2nd measure will eliminate the 2nd issue: we'll make the number of repetitions explicit (Python Zen, koan 2) and thus highly visible and manageable.
As for the 1st issue, if you really only need "up to known, same N" items in each case, you won't actually be doing "excessive work" by finding them together with your string.
The "fix" part here is \w*\W* -> \w+\W+. This eliminates major ambiguity (see the above link) from the fact that each x* can be a blank match.
Matching up to N words before the string effectively is harder:
with (\w+\W+){,10} or equivalent, the matcher will be finding every 10 words before discovering that your string doesn't follow them, then trying 9,8, etc. To ease it up on the matcher somewhat, \b before the pattern will make it only perform all this work at the beginning of each word
lookbehind is not allowed here: as the linked article explains, the regex engine must know how many characters to step back before trying the contained regex. And even if it was - a lookbehind is tried before every character - i.e. it's even more of a CPU hog
As you can see, regexes aren't quite cut to match things backwards
To eliminate code duplication, either
use the aforementioned {,10}. This will not save individual words but should be noticeably faster for large text (see the above on how the matching works here). We can always parse the retrieved chunk of text in more details (with the regex in the next item) once we have it. Or
autogenerate the repetitive part
note that (\w+\W+)? repeated mindlessly is subject to the same ambiguity as above. To be unambiguous, the expression must be like this (w=(\w+\W+) here for brevity): (w(w...(ww?)?...)?)? (and all the groups need to be non-capturing).
I personally think that using text.partition() is the best option, as it eliminates the messy regular expressions, and automatically leaves output in an easy-to-access tuple.

Need to match whole word completely from long set with no partials in Python

I have a function that - as a larger part of a different program - checks to see if a word entry is in a text file. So if the text file looks like this:
aardvark
aardvark's
aardvarks
abaci
.
.
.
zygotes
I just ran a quick if statement
infile = open("words","r") # Words is the file with all the words. . . yeah.
text = infile.read()
if word in text:
return 1
else:
return 0
Works, sort-of. The problem is, while it returns true for aardvark, and false for wj;ek, it also will return true for any SUBSET of any word. So, for example, the word rdva will come back as a 'word' because it IS in the file, as a subset of aardvark. I need it to match whole words only, and I've been quite stumped.
So how can I have it match an entire word (which is equivalent to an entire line, here) or nothing?
I apologize if this question is answered elsewhere, I searched before I posted!
Many Thanks!
Iterate over each line and see if the whole line matches:
def in_dictionary(word):
for line in open('words', 'r').readlines():
if word == line.strip():
return True
return False
When you use the in statement, you are basically asking whether the word is in the line.
Using == matches the whole line.
.strip() removes leading and trailing whitespace, which will cause hello to not equal {space}hello
There is a simpler approach. Your file is, conceptually, a list of words, so build that list of words (instead of a single string).
with open("words") as infile: words = infile.read().split()
return word in words
<string> in <string> does a substring search, but <anything> in <list> checks for membership. If you are going to check multiple times against the same list of words, then you may improve performance by instead storing a set of the words (just pass the list to the set constructor).
Blender's answer works, but here is a different way that doesn't require you to iterate yourself:
Each line is going to end with a newline character (\n). So, what you can do is put a \n before and after your checked string when comparing. So something like this:
infile = open("words","r") # Words is the file with all the words. . . yeah.
text = "\n" + infile.read() # add a newline before the file contents so we can check the first line
if "\n"+word+"\n" in text:
return 1
else:
return 0
Watch out, though -- your line endings may be \r\n or just \r too.
It could also have problems if the word you're checking CONTAINS a newline. Blender's answer is better.
That's all great until you want to verify every word in a longer text using that list. For me and /usr/share/dict/words it takes up to 3ms to check a single word in words. Therefore I suggest using a dictionary (no pun) instead. Lookups were about 2.5 thousand times faster with:
words = {}
for word in open('words', 'r').readlines():
words[word.strip()] = True
def find(word):
return word in words

Python: find regexp in a file

Have:
f = open(...)
r = re.compile(...)
Need:
Find the position (start and end) of a first matching regexp in a big file?
(starting from current_pos=...)
How can I do this?
I want to have this function:
def find_first_regex_in_file(f, regexp, start_pos=0):
f.seek(start_pos)
.... (searching f for regexp starting from start_pos) HOW?
return [match_start, match_end]
File 'f' is expected to be big.
One way to search through big files is to use the mmap library to map the file into a big memory chunk. Then you can search through it without having to explicitly read it.
For example, something like:
size = os.stat(fn).st_size
f = open(fn)
data = mmap.mmap(f.fileno(), size, access=mmap.ACCESS_READ)
m = re.search(r"867-?5309", data)
This works well for very big files (I've done it for a file 30+ GB in size, but you'll need a 64-bit OS if your file is more than a GB or two).
The following code works reasonably well with test files around 2GB in size.
def search_file(pattern, filename, offset=0):
with open(filename) as f:
f.seek(offset)
for line in f:
m = pattern.search(line)
if m:
search_offset = f.tell() - len(line) - 1
return search_offset + m.start(), search_offset + m.end()
Note that the regular expression must not span multiple lines.
NOTE: this has been tested on python2.7. You may have to tweak things in python 3 to handle strings vs bytes but it shouldn't be too painful hopefully.
Memory mapped files may not be ideal for your situation (32-bit mode increases chance there isn't enough contiguous virtual memory, can't read from pipes or other non-files, etc).
Here is a solution that reads 128k blocks at a time and as long as your regex matches a string smaller than that size, this will work. Also note you are not restricted by using single-line regexes. This solution works plenty fast, although I suspect it will be marginally slower than using mmap. It probably depends more on what you're doing with the matches, as well as the size/complexity of the regex you're searching for.
The method will make sure to only keep a maximum of 2 blocks in memory. You might want to enforce at least 1 match per block as a sanity check in some use cases, but this method will truncate in order to keep the maximum of 2 blocks in memory. It also makes sure that any regex match that eats to the end of the current block is NOT yielded and instead the last position is saved for when either the true input is exhausted or we have another block that the regex matches before the end of, in order to better match patterns like "[^\n]+" or "xxx$". You may still be able to break things if you have a lookahead at the end of the regex like xx(?!xyz) where yz is in the next block, but in most cases you can work around using such patterns.
import re
def regex_stream(regex,stream,block_size=128*1024):
stream_read=stream.read
finditer=regex.finditer
block=stream_read(block_size)
if not block:
return
lastpos = 0
for mo in finditer(block):
if mo.end()!=len(block):
yield mo
lastpos = mo.end()
else:
break
while True:
new_buffer = stream_read(block_size)
if not new_buffer:
break
if lastpos:
size_to_append=len(block)-lastpos
if size_to_append > block_size:
block='%s%s'%(block[-block_size:],new_buffer)
else:
block='%s%s'%(block[lastpos:],new_buffer)
else:
size_to_append=len(block)
if size_to_append > block_size:
block='%s%s'%(block[-block_size:],new_buffer)
else:
block='%s%s'%(block,new_buffer)
lastpos = 0
for mo in finditer(block):
if mo.end()!=len(block):
yield mo
lastpos = mo.end()
else:
break
if lastpos:
block=block[lastpos:]
for mo in finditer(block):
yield mo
To test / explore, you can run this:
# NOTE: you can substitute a real file stream here for t_in but using this as a test
t_in=cStringIO.StringIO('testing this is a 1regexxx\nanother 2regexx\nmore 3regexes')
block_size=len('testing this is a regex')
re_pattern=re.compile(r'\dregex+',re.DOTALL)
for match_obj in regex_stream(re_pattern,t_in,block_size=block_size):
print 'found regex in block of len %s/%s: "%s[[[%s]]]%s"'%(
len(match_obj.string),
block_size,match_obj.string[:match_obj.start()].encode('string_escape'),
match_obj.group(),
match_obj.string[match_obj.end():].encode('string_escape'))
Here is the output:
found regex in block of len 46/23: "testing this is a [[[1regexxx]]]\nanother 2regexx\nmor"
found regex in block of len 46/23: "testing this is a 1regexxx\nanother [[[2regexx]]]\nmor"
found regex in block of len 14/23: "\nmore [[[3regex]]]es"
This can be useful in conjunction with quick-parsing a large XML where it can be split up into mini-DOMs based on a sub element as root, instead of having to dive into handling callbacks and states when using a SAX parser. It also allows you to filter through XML faster as well. But I've used it for tons of other purposes as well. I'm kind of surprised recipes like this aren't more readily available on the net!
One more thing: Parsing in unicode should work as long as the passed in stream is producing unicode strings, and if you're using the character classes like \w, you'll need to add the re.U flag to the re.compile pattern construction. In this case block_size actually means character count instead of byte count.

Categories

Resources