findall/finditer on a stream? - python

Is there a way to get the re.findall, or better yet, re.finditer functionality applied to a stream (i.e. an filehandle open for reading)?
Note that I am not assuming that the pattern to be matched is fully contained within one line of input (i.e. multi-line patterns are permitted). Nor am I assuming a maximum match length.
It is true that, at this level of generality, it is possible to specify a regex that would require that the regex engine have access to the entire string (e.g. r'(?sm).*'), and, of course, this means having to read the entire file into memory, but I am not concerned with this worst-case scenario at the moment. It is, after all, perfectly possible to write multi-line-matching regular expressions that would not require reading the entire file into memory.
Is it possible to access the underlying automaton (or whatever is used internally) from a compiled regex, to feed it a stream of characters?
Thanks!
Edit: Added clarifications regarding multi-line patterns and match lengths, in response to Tim Pietzcker's and rplnt's answers.

This is possible if you know that a regex match will never span a newline.
Then you can simply do
for line in file:
result = re.finditer(regex, line)
# do something...
If matches can extend over multiple lines, you need to read the entire file into memory. Otherwise, how would you know if your match was done already, or if some content further up ahead would make a match impossible, or if a match is only unsuccessful because the file hasn't been read far enough?
Edit:
Theoretically it is possible to do this. The regex engine would have to check whether at any point during the match attempt it reaches the end of the currently read portion of the stream, and if it does, read on ahead (possibly until EOF). But the Python engine doesn't do this.
Edit 2:
I've taken a look at the Python stdlib's re.py and its related modules. The actual generation of a regex object, including its .match() method and others is done in a C extension. So you can't access and monkeypatch it to also handle streams, unless you edit the C sources directly and build your own Python version.

It would be possible to implement on regexp with known maximum length. Either no +/* or ones where you know maximum numbers of repetition. If you know this you can read file by chunks and match on these, yielding the result. You would also run the regexp on overlapping chunk than would cover the case when the regexp would match but was stopped by the end of a string.
some pseudo(python)code:
overlap_tail = ''
matched = {}
for chunk in file.stream(chunk_size):
# calculate chunk_start
for result in finditer(match, overlap_tail+chunk):
if not chunk_start + result.start() in matched:
yield result
matched[chunk_start + result.start()] = result
# delete old results from dict
overlap_tail = chunk[-max_re_len:]
Just an idea but I hope you get what I'm trying to achieve. You'd need to consider that file(stream) could end and some other cases. But I think it can be done (if the length of the regular expression is limited(known)).

Related

How to find filenames with a specific extension using regex?

How can I grab 'dlc3.csv' & 'spongebob.csv' from the below string via the absolute quickest method - which i assume is regex?
4918, fx,fx,weapon/muzzleflashes/fx_m1carbine,3.3,3.3,|sp/zombie_m1carbine|weapon|../zone_source/dlc3.csv|csv|../zone_source/spongebob.csv|csv
I've already managed to achieve this by using split() and for loops but its slowing my program down way too much.
I would post an example of my current code but its got a load of other stuff in it so it would only cause you to ask more questions.
In a nutshell im opening a large 6,000 line .csv file and im then using nested for loops to iterate through each line and using .split() to find specific parts in each line. I have many files where i need to scan specific things on each line and atm ive only implemented a couple features into my Qt program and its already taking upto 5 seconds to load some things and up to 10 seconds for others. All of which is due to the nested loops. Ive looked at where to use range, where not to, and where to use enumerate. I also use time.time() and loggin.info() to show each code changes speed. And after asking around ive been told that using a regex is the best option for me as it would remove the need for many of my for loops. Problem is i have no clue how to use regex. I of course plan on learning it but if someone could help me out with this it'll be much appreciated.
Thanks.
Edit: just to point out that when scanning each line the filename is unknown. ".csv" is the only thing that isnt unknown. So i basically need the regex to grab every filename before .csv but of course without grabbing the crap before the filename.
Im currently looking for .csv using .split('/') & .split('|'), then checking if .csv is in list index to grab the 'unknown' filename. And some lines will only have 1 filename whereas others will have 2+ so i need the regex to account for this too.
You can use this pattern: [^/]*\.csv
Breakdown:
[^/] - Any character that's not a forward slash (or newline)
* - Zero or more of them
\. - A literal dot. (This is necessary because the dot is a special character in regex.)
For example:
import re
s = '''4918, fx,fx,weapon/muzzleflashes/fx_m1carbine,3.3,3.3,|sp/zombie_m1carbine|weapon|../zone_source/dlc3.csv|csv|../zone_source/spongebob.csv|csv'''
pattern = re.compile(r'[^/]*\.csv')
result = pattern.findall(s)
Result:
['dlc3.csv', 'spongebob.csv']
Note: It could just as easily be result = re.findall(r'[^/]*\.csv', s), but for code cleanliness, I prefer naming my regexes. You might consider giving it an even clearer name in your code, like pattern_csv_basename or something like that.
Docs: re, including re.findall
See also: The official Python Regular Expression HOWTO

Think you know Python RE? Here's a challenge

Here's the skinny: how do you make a character set match NOT a previously captured character?
r'(.)[^\1]' # doesn't work
Here's the uh... fat? It's part of a (simple) cryptography program. Suppose "hobo" got coded to "fxgx". The program only gets the encoded text and has to figure what it could be, so it generates the pattern:
r'(.)(.)(.)\2' # 1st and 3rd letters *should* be different!
Now it (correctly) matches "hobo", but also matches "hoho" (think about it!). I've tried stuff like:
r'(.)([^\1])([^\1\2])\2' # also doesn't work
and MANY variations but alas! Alack...
Please help!
P.S. The work-around (which I had to implement) is to just retrieve the "hobo"s as well the "hoho"s, and then just filter the results (discarding the "hoho"s), if you catch my drift ;)
P.P.S Now I want a hoho
VVVVV THE ANSWER VVVVV
Yes, I re-re-read the documentation and it does say:
Inside the '[' and ']' of a character class, all numeric escapes are
treated as characters.
As well as:
Special characters lose their special meaning inside sets.
Which pretty much means (I think) NO, you can't do anything like:
re.compile(r'(.)[\1]') # Well you can, but it kills the back-reference!
Thanks for the help!
1st and 3rd letters should be different!
This cannot be detected using a regular expression (not just python's implementation). More specifically, it can't be detected using automata without memory. You'll have to use a different kind of automata.
The kind of grammar you're trying to discover (‫‪reduplication‬‬) is not regular. Moreover, it is not context-free.
Automata is the mechanism which allows regular expression match to be so efficient.

Pushing data into regex?

I am working on a small project which I call pydlp. It is basically a set of regex signatures that will extract data from a fileobject. And a function that check if extracted data is in fact interesting.
This code is how I perform matching. It is far from optimal, as I have to read the file over and over again.
for signature in signatures:
match = signature.validate(signature.regex.match(fobj.read())))
if match: matches.append(match)
fobj.seek(0)
Is there a way to perform multiple regex matches on the same file object while only reading the file object content once. The file object can be large, so I cannot put it in memory.
Edit:
I want to clarify why I mean by "pushing data into regex". I recognize that regex has similarities with a finite state machine. Instead of passing the whole data at once to the regex engine, is it possible to push parts of it at a time?
while True:
data = fobj.read(1024)
if data == "": break
for signature in signatures:
match = signature.regex.push_and_match(data)
if match: matches.append(match)
Edit 2:
Removed link, as I removed the project from github.
The standard way to do this sort of text processing with files too large to read into memory is to iterate over the file line by line:
regexes = [ .... ]
with open('large.file.txt') as fh:
for line in fh:
for rgx in regexes:
m = rgx.search(line)
if m:
# Do stuff.
But that approach assumes your regexes can operate successfully on single lines of text in isolation. If they cannot, perhaps there are other units that you can pass to the regexes (eg, paragraphs delimited by blank lines). In other words, you might need to do a bit of pre-parsing so that you can grab meaningful sections of text before sending them to your main regexes.
with open('large.file.txt') as fh:
section = []
for line in fh:
if line.strip():
section.append(line)
else:
# We've hit the end of a section, so we
# should check it against our regexes.
process_section(''.join(section), regexes)
section = []
# Don't forget the last one.
if section:
process_section('\n'.join(section), regexes)
Regarding your literal question: "Is there a way to perform multiple regex matches on the same file object while only reading the file object content once". No and yes. No in the sense that Python regexes operate on strings, not file objects. But you can perform multiple regex searches at the same time on one string, simply by using alternation. Here's a minimal example:
patterns = 'aa bb cc'.split()
big_regex = '|'.join(patterns) # Match this or that or that.
m = big_regex.search(some_text)
But that doesn't really solve your problem if the file is too big for memory.
Maybe consider using re.findall() if you don't need match objects but only matched strings? If file is too big you can slice it for parts, as you suggest, but using some overlaps, to miss no regexes (if you know nature of regexes maybe it'll be possible to find out how big overlap should be).

How do I use re.search starting from a certain index in the string?

Seems like a simple thing but I'm not seeing it. How do I start the search in the middle of a string?
The re.search function doesn't take a start argument like the str methods do. But search method of a compiled re.compile/re.RegexObject pattern does take a pos argument.
This makes sense if you think about it. If you really need to use the same regular expressions over and over, you probably should be compiling them. Not so much for efficiency—the cache works nicely for most applications—but just for readability.
But what if you need to use the top-level function, because you can't pre-compile your patterns for some reason?
Well, there are plenty of third-party regular expression libraries. Some of these wrap PCRE or Google's RE2 or ICU, some implement regular expressions from scratch, and they all have at least slightly different, sometimes radically different, APIs.
But the regex module, which is being designed to be an eventual replacement for re in the stdlib (although it's been bumped a couple times now because it's not quite ready) is pretty much usable as a drop-in replacement for re, and (among other extensions) it takes pos and endpos arguments on its search function.
Normally, the most common reason you'd want to do this is to "find the next match after the one I just found", and there's a much easier way to do that: use finditer instead of search.
For example, this str-method loop:
i = 0
while True:
i = s.find(sub, i)
if i == -1:
break
do_stuff_with(s, i)
… translates to this much nicer regex loop:
for match in re.finditer(pattern, s):
do_stuff_with(match)
When that isn't appropriate, you can always slice the string:
match = re.search(pattern, s[index:])
But that makes an extra copy of half your string, which could be a problem if string is actually, say, a 12GB mmap. (Of course for the 12GB mmap case, you'd probably want to map a new window… but there are cases where that won't help.)
Finally, you can always just modify your pattern to skip over index characters:
match = re.search('.{%d}%s' % (index, pattern), s)
All I've done here is to add, e.g., .{20} to the start of the pattern, which means to match exactly 20 of any character, plus whatever else you were trying to match. Here's a simple example:
.{3}(abc)
Debuggex Demo
If I give this abcdefabcdef, it will match the first 'abc' after the 3rd character—that is, the second abc.
But notice that what it actually matches 'defabc'. Because I'm using capture groups for my real pattern, and I'm not putting the .{3} in a group, match.group(1) and so on will work exactly as I'd want them to, but match.group(0) will give me the wrong thing. If that matters, you need lookbehind.

re.findall regex hangs or very slow

My input file is a large txt file with concatenated texts I got from an open text library. I am now trying to extract only the content of the book itself and filter out other stuff such as disclaimers etc. So I have around 100 documents in my large text file (around 50 mb).
I then have identified the start and end markers of the contents themselves, and decided to use a Python regex to find me everything between the start and end marker. To sum it up, the regex should look for the start marker, then match everything after it, and stop looking once the end marker is reached, then repeat these steps until the end of the file is reached.
The following code works flawlessly when I feed a small, 100kb sized file into it:
import codecs
import re
outfile = codecs.open("outfile.txt", "w", "utf-8-sig")
inputfile = codecs.open("infile.txt", "r", "utf-8-sig")
filecontents = inputfile.read()
for result in re.findall(r'START\sOF\sTHE\sPROJECT\sGUTENBERG\sEBOOK.*?\n(.*?)END\sOF\THE\sPROJECT\sGUTENBERG\sEBOOK', filecontents, re.DOTALL):
outfile.write(result)
outfile.close()
When I use this regex operation on my larger file however, it will not do anything, the program just hangs. I tested it overnight to see if it was just slow and even after around 8 hours the program was still stuck.
I am very sure that the source of the problem is the
(.*?)
part of the regex, in combination with re.DOTALL.
When I use a similar regex on smaller distances, the script will run fine and fast.
My question now is: why is this just freezing up everything? I know the texts between the delimiters are not small, but a 50mb file shouldn't be too much to handle, right?
Am I maybe missing a more efficient solution?
Thanks in advance.
You are correct in thinking that using the sequence .*, which appears more than once, is causing problems. The issue is that the solver is trying many possible combinations of .*, leading to a result known as catastrophic backtracking.
The usual solution is to replace the . with a character class that is much more specific, usually the production that you are trying to terminate the first .* with. Something like:
`[^\n]*(.*)`
so that the capturing group can only match from the first newline to the end. Another option is to recognize that a regular expression solution may not be the best approach, and to use either a context free expression (such as pyparsing), or by first breaking up the input into smaller, easier to digest chunks (for example, with corpus.split('\n'))
Another workaround to this issue is adding a sane limit to the number of matched characters.
So instead of something like this:
[abc]*.*[def]*
You can limit it to 1-100 instances per character group.
[abc]{1,100}.{1,100}[def]{1,100}
This won't work for every situation, but in some cases it's an acceptable quickfix.

Categories

Resources