Performing incremental regex searches in huge strings (Python)

Performing incremental regex searches in huge strings (Python) - python

Using Python 2.6.6.
I was hoping that the re module provided some method of searching that mimicked the way str.find() works, allowing you to specify a start index, but apparently not...
search() lets me find the first match...
findall() will return all (non-overlapping!) matches of a single pattern
finditer() is like findall(), but via an iterator (more efficient)
Here is the situation... I'm data mining in huge blocks of data. For parts of the parsing, regex works great. But once I find certain matches, I need to switch to a different pattern, or even use more specialized parsing to find where to start searching next. If re.search allowed me to specify a starting index, it would be perfect. But in absence of that, I'm looking at:
Using finditer(), but skipping forward until I reach an index that is past where I want to resume using re. Potential problems:
If the embedded binary data happens to contain a match that overlaps a legitimate match just after the binary chunk...
Since I'm not searching for a single pattern, I'd have to juggle multiple iterators, which also has the possibility of a false match hiding the real one.
Slicing, i.e., creating a copy of the remainder of the data each time I want to search again.
This would be robust, but would force a lot of "needless" copying on data that could be many megabytes.
I'd prefer to keep it so that all match locations were indexes into the single original string object, since I may hang onto them for a while and want to compare them. Finding subsequent matches within separate sliced-off copies is a bookkeeping hassle.
Just occurred to me that I may be able to use a "rotating buffer" sort of approach, but haven't thought it through completely. That might introduce a lot of complexity to the code.
Am I missing any obvious alternatives? Not sure if there would be a way to wrap a huge string with a class that would serve slices... Or a slicing sort of iterator or "string cursor" idiom?

Use a two-pass approach. The first pass uses the first regex to find the "interesting bits" and outputs those offsets into a separate file. You didn't say if you can tell where the "end" of each interesting segment is, but you'd include that too if available. The second pass uses the offsets to load sections of the file as independent strings and then applies whatever secondary regex you like on each smaller string.

Related

Optimizing regular expression techniques

I am wondering about optimization techniques for regex
So I am trying to parse every instance of money out of a 400k line corpus. I needed to also include lines such as "$10,999.04" as well as "one billion and six hundred twenty five thousand dollars" and everything in between. This required a very lengthy regular expression with multiple instances of groups like
MONEYEXPRESSION = '(?:\d\d?\d?(?:,?\d{3})*(?:\.\d+)?)'
(one|two|...|ninety[\s-]?nine|hundred|a hundred|MONEYEXPRESSION)((\s*and\s*|\s*-\s*|\s*)(one|two|...|ninety[\s-]?nine|hundred|a hundred|MONEYEXPRESSION))*
Even more than that, In order to require it to be an instance of money and avoid matching lines such as "five hundred people were at the event" I have 4 OR'd options that require "$", "dollars?", or "cents?" to be in specific places in the sentence at least once.
the regular expression is almost 20k characters! :(
You can imagine with an expression this extensive, with any bad practices, it really adds to the time. I have been running this on the corpus for the past 2 hours and it has still not completed the matching. I was wondering what some of the best practices for optimizing and trimming unnecessary regex is. What operations I am using are expensive and can be supplemented for better ones. And if maybe there is just a better way to solve this problem?

You're asking about optimizing performance, so let's focus on that. What makes the regexp engine really slow is backtracking, and what causes backtracking is parts that might succeed in different places in the string, without clear ways to decide. So try these rules of thumb:
From the backtracking link above: "When nesting repetition operators, make absolutely sure that there is only one way to match the same match."
Avoid large optional components. Instead of something like (<number>? <number>)? <number> to match a sequence with space-separated elements, write (<number> ?)+.
Avoid components that can be satisfied by the empty string-- the engine will be trying to satisfy them at every position.
Ensure that unconstrained parts in your regexps are bounded in length, especially if the later part cannot be reliably recognized. Things like A.*B? are asking for trouble-- this can match anything that starts with A.
Don't use lookahead/lookbehind. Almost always there's a simpler way.
More generally, keep it simple. I'm not sure how you managed to get to 20K characters for this task, but I'll bet there are ways to simplify it. One consideration is that it's ok to match things that you will not see anyway.
For example, why match all numbers from one to ninety nine, and not just their components? Yeah, you'll match nonsense like "nine ninety and dollars", but that does no harm. You are searching for money amounts, not validating input. E.g, this should match all written out dollar amounts less than a million dollars:
((one|two|three|...|twenty|thirty|...|ninety|hundred|thousand|and) ?)+ (dollars?|euros?)\b
Since this is tagged "python", here are two more suggestions:
If the task (or assignment) allows you to break up the search in steps, do so.
A regexp that does everything often has to be so complex, that it is slower than simply running several searches in sequence.
Even if you are constrained to use one monster regexp, write it in pieces and use python to assemble it into one string. It won't make any difference when executing, but will be a lot easier to work with.

I would try and do something of the sort:
keywords = ["$","dollar","dollars","cent","cents"]
my_file = r"c:\file.txt"
output = r"c:\output.txt"
filtered_lines = []
with open(my_file,"r") as f:
for line in f:
for k in keywords:
if k in line:
filtered_lines.append(line)
break
with open(output,"w") as o:
o.write("\n".join(filtered_lines))

I'd keep the numerical and the written-out regex apart and do it in two steps, first extract the numerical amounts (which is the easy part) and then do the written-out amounts.
The most problematic thing with the written-out part is that if you have one hundred people it will try out all the billions and thousands and everything already on the word one just to find out in the end that there is no dollars. But even worse, it will try again everything for the word hundred and then for people...
Ideally it would therefore start from the back so it doesn't try to match everything with every single word but only 'dollars', 'cents' or whatsoever and only then do the expensive part.
Therefore, if possible, try to match your file back to front for the written-out stuff. It will definitely get quite hard to wrap your head around that but I bet it will be significantly faster.
And if it is not possible, I hope now you at least know where the main bottleneck is.
Ah, and some word boundaries might also help to reduce the matching from at every character to at every start of a word... I didn't mention it above but actually for this example the engine starts to match at 'o' then again at 'n','e',' ' and so on.

Python Regular Expressions - Ignore sequence during searching

I am trying to make a kind of data miner with python. What I am about to examine is a dictionary of the Greek language. The said dictionary was originally in PDF format, and I turned it into a rougly corresponding HTML format to parse it more easily. I have done some further formating on it, since the data structure was heavily distorted.
My current task is to find and seperately store the individual words, along with their descriptions. So the first thought that came to mind about that, was to identify the words first, apart from their descriptions. The headers of the word's space has a very specific syntax, and I use that to create a corresponding regular expression to match each and every one of them.
There is one problem though. Despite the formatting I have done to HTML so far, there are still many points where a series of logical data is interrupted by the sequence < /br> followed by a newline, with random order. Is there any way to direct my regular expression to "ignore" that sequence, that is to treat that certain sequence as non-existent, when met, and therefore including those matches which are interrupted by it?
That is, without putting a (< br/>\n)? in every part of my RE, to cover every possible case.
The regular expression I use is the following:
(ο|η|το)?( )?<b>([α-ωάέήίόύώϊϋΐΰ])*</b>(, ((ο|η|το)? <b>([α-ωάέήίόύώϊϋΐΰ])*</b>))*( \(.*\))? ([Α-Ω])*\.( \(.*\))?<b>:</b>
and does a fine job with the matching, when the data is not interrupted by the sequence given above.
The problem, in case not understood, lies in that the interrupting sequence can occur anywhere within the match, therefore I am looking for a way other than covering every single spot where the sequence might occur (ignoring the sequence in deciding whether to return a match or not), as I explained earlier.

What you're asking for is a different regular expression.
The new regular expression would be the old one, with (<br\s*?/>\n?)? or the like after every non-quantifier character.
You could write something to transmute a regular expression into the form you're looking for. It would take in your existing regex and produce a br-tolerant regex. No construct in the regular expression grammar exists to do this for you automatically.
I think the easier thing to do is to permute the source document to not contain the sequences you wish to ignore. This should be an easy text substitution.
If it weren't for your explicit use of the <b> tags for meaning, an alternative would be to just take the plain-text document content instead of the HTML content.

Ignoring whitespace in a python diff

Is there an elegant way to ignore whitespace in a diff in python (using difflib, or any other module)? Maybe I missed something, but I've scoured the documentation, and was unable to find any explicit support for this in difflib.
My current solution is to just break my text into lists of words, and then diff those:
d.compare(("".join(text1_lines)).split(), ("".join(text2_lines)).split())
The disadvantage of this is that if one wants a report of line-by-line differences, rather than word-by-word, one must merge the output of the diff with the original file text. This is easily doable, but a bit inconvenient.

Remove all replicas of a string more than x characters long (regex?)

I'm not certain that regex is the best approach for this, but it seems to be fairly well suited. Essentially I'm currently parsing some pdfs using pdfminer, and the drawback is that these pdf's are exported powerpoint slides, which means that all animations show up as fairly long copies of strings. Ideally I would like just one copy of each of these strings instead of a copy for each stage of an animation. Right now the current regex pattern I'm using is this:
re.sub(r"([\w^\w]{10,})\1{1,}", "\1", string)
For some reason though, this doesn't seem to change the input string. I feel like for some reason python isn't recognizing the capture group, but I'm not sure how to remedy that issue. Any thoughts appreciated.
Examples:
I would like this
text to be
reduced
I would like this
text to be
reduced
output:
I would like this
text to be
reduced
Update:
To get this to pass the pumping lemma I had to specifically make the assertion that all duplicates were adjacent. This was implied before, but I am now making it explicit to ensure that a solution is possible.

regexps are not the right tool for that task. They are based on the theory of context free languages, and they can't match if a string contains duplicates and remove the duplicates. You may find a course on automata and regexps interesting to read on the topic.
I think Josay's suggestion can be efficient and smart, but I think I got a more simple and pythonic solution, though it has its limits. You can split your string into a list of lines, and pass it through a set():
>>> s = """I would like this
... text to be
...
... reduced
... I would like this
... text to be
...
... reduced"""
>>> print "\n".join(set(s.splitlines()))
I would like this
text to be
reduced
>>>
The only thing with that solution is that you will loose the original order of the lines (the example being a pretty counter example). Also, if you have the same line in two different contexts, you will end up having only one line.
To fix the first problem, you may have to then iterate over your original string a second time to put that set back in order, or simply use an ordered set.
If you got any symbol separating each slide, it would help you merge only the duplicates, fixing the second problem of that solution.
Otherwise a more sophisticated algorithm would be needed, so you can take into account proximity and context. For that a suffix tree could be a good idea, and there are python libraries for that (cf that SO answer).
edit:
using your algorithm I could make it work, by adding support of multiline and adding spaces and endlines to your text matching:
>>> re.match(r"([\w \n]+)\n\1", string, re.MULTILINE).groups()
('I would like this\ntext to be\n\nreduced',)
Though, afaict the \1 notation is not a regular regular expression syntax in the matching part, but an extension. But it's getting late here, and I may as well be totally wrong. Maybe shall I reread those courses? :-)
I guess that the regexp engine's pushdown automata is able to push matches, because it is only a long multiline string that it can pop to match. Though I'd expect it to have side effects...

fault-tolerant python based parser for WikiLeaks cables

Some time ago I started writing a BNF-based grammar for the cables which WikiLeaks released. However I now realized that my approach is maybe not the best and I'm looking for some improvement.
A cabe consists of three parts. The head has some RFC2822-style format. This parses usually correct. The text part has a more informal specification. For instance, there is a REF-line. This should start with REF:, but I found different versions. The following regex catches most cases: ^\s*[Rr][Ee][Ff][Ss: ]. So there are spaces in front, different cases and so on. The text part is mostly plain text with some special formatted headings.
We want to recognize each field (date, REF etc.) and put into a database. We chose Pythons SimpleParse. At the moment the parses stops at each field which it doesn't recognize. We are now looking for a more fault-tolerant solution. All fields have some kind of order. When the parser don't recognize a field, it should add some 'not recognized'-blob to the current field and go on. (Or maybe you have some better approach here).
What kind of parser or other kind of solution would you suggest? Is something better around?

Cablemap seems to do what you're searching for: http://pypi.python.org/pypi/cablemap.core/

I haven't looked at the cables but lets take a similar problem and consider the options: Lets say you wanted to write a parser for RFCs, there's an RFC for formatting of RFCs, but not all RFCs follow it.
If you wrote a strict parser, you'll run into the situation you've run into - the outliers will halt your progress - in this case you've got two options:
Split them into two groups, the ones that are strictly formatted and the ones that aren't. Write your strict parser so that it gets the low hanging fruit and figure out based on the number outliers what the best options are (hand processing, outlier parser, etc)
If the two groups are equally sized, or there are more outliers than standard formats - write a flexible parser. In this case regular expressions are going to be more beneficial to you as you can process an entire file looking for a series of flexible regexs, if one of the regexes fails you can easily generate the outlier list. But, since you can make the search against a series of regexes you could build a matrix of pass/fails for each regex.
For 'fuzzy' data where some follow the format and some do not, I much prefer using the regex approach. That's just me though. (Yes, it is slower, but having to engineer the relationship between each match segment so that you have a single query (or parser) that fits every corner case is a nightmare when dealing with human generated input.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.