fault-tolerant python based parser for WikiLeaks cables

fault-tolerant python based parser for WikiLeaks cables - python

Some time ago I started writing a BNF-based grammar for the cables which WikiLeaks released. However I now realized that my approach is maybe not the best and I'm looking for some improvement.
A cabe consists of three parts. The head has some RFC2822-style format. This parses usually correct. The text part has a more informal specification. For instance, there is a REF-line. This should start with REF:, but I found different versions. The following regex catches most cases: ^\s*[Rr][Ee][Ff][Ss: ]. So there are spaces in front, different cases and so on. The text part is mostly plain text with some special formatted headings.
We want to recognize each field (date, REF etc.) and put into a database. We chose Pythons SimpleParse. At the moment the parses stops at each field which it doesn't recognize. We are now looking for a more fault-tolerant solution. All fields have some kind of order. When the parser don't recognize a field, it should add some 'not recognized'-blob to the current field and go on. (Or maybe you have some better approach here).
What kind of parser or other kind of solution would you suggest? Is something better around?

Cablemap seems to do what you're searching for: http://pypi.python.org/pypi/cablemap.core/

I haven't looked at the cables but lets take a similar problem and consider the options: Lets say you wanted to write a parser for RFCs, there's an RFC for formatting of RFCs, but not all RFCs follow it.
If you wrote a strict parser, you'll run into the situation you've run into - the outliers will halt your progress - in this case you've got two options:
Split them into two groups, the ones that are strictly formatted and the ones that aren't. Write your strict parser so that it gets the low hanging fruit and figure out based on the number outliers what the best options are (hand processing, outlier parser, etc)
If the two groups are equally sized, or there are more outliers than standard formats - write a flexible parser. In this case regular expressions are going to be more beneficial to you as you can process an entire file looking for a series of flexible regexs, if one of the regexes fails you can easily generate the outlier list. But, since you can make the search against a series of regexes you could build a matrix of pass/fails for each regex.
For 'fuzzy' data where some follow the format and some do not, I much prefer using the regex approach. That's just me though. (Yes, it is slower, but having to engineer the relationship between each match segment so that you have a single query (or parser) that fits every corner case is a nightmare when dealing with human generated input.

Related

How to extract text between two headings with regex, requires complicated non-capture groups

I want to pull abstracts out of a large corpus of scientific papers using a python script. The papers are all saved as strings in a large csv. I want to something like this: extracting text between two headers I can write a regex to find the 'Abstract' heading. However, finding the next section heading is proving difficult. Headers vary wildly from paper to paper. They can be ALL CAPS or Just Capitalized. They can be one word or a long phrase and span two lines. They are usually followed by one-two newlines. This is what I came up with: -->
abst = re.findall(r'(?:ABSTRACT\s*\n+|Abstract\s*\n+)(.*?)((?:[A-Z]+|(?:\n(?:[A-Z]+|(?:[A-Z][a-z]+\s*)+)\n+)',row[0],re.DOTALL)
Here is an example of an abstract:
'...\nAbstract\nFactorial Hidden Markov Models (FHMMs) are powerful models for
sequential\ndata but they do not scale well with long sequences. We
propose a scalable inference and learning algorithm for FHMMs that
draws on ideas from the stochastic\nvariational inference, neural
network and copula literatures. Unlike existing approaches, the
proposed algorithm requires no message passing procedure among\nlatent
variables and can be distributed to a network of computers to speed up
learning. Our experiments corroborate that the proposed algorithm does
not introduce\nfurther approximation bias compared to the proven
structured mean-field algorithm,\nand achieves better performance with
long sequences and large FHMMs.\n\n1\n\nIntroduction\n\n...'
So I'm trying to find 'Abstract' and 'Introduction' and pull out the text that is between them. However it could be 'ABSTRACT' and 'INTRODUCTION', or ABSTRACT and 'A SINGLE LAYER NETWORK AND THE MEAN FIELD\nAPPROXIMATION\n'
Help?

Recognizing the next section is a bit vague - perhaps we can rely on Abstract-section ending with two newlines?
ABSTRACT\n(.*)\n\n
Or maybe we'll just assume that the next section-title will start with an uppercase letter and be followed by any number of word-characters. (Also that's rather vague, too, and assumes there'l be no \n\n within the Abstract.
ABSTRACT\n(.*)\n\n\U[\w\s]*\n\n
Maybe that stimulates further fiddling on your end... Feel free to post examples where this did not match - maybe we can stepwise refine it.
N.B: as Wiktor pointed out, I could not use the case-insensitive modifiers. So the whole rx should be used with switches for case-insenstive matching.
Update1: the challenge here is really how to identify that a new section has begun...and not to confuse that with paragraph-breaks within the Abstract. Perhaps that can also be dealt with by changing the rather tolerant [\w\s]*with [\w\s]{1,100} which would only recognize text in a new paragraph as a title of the "abstract-successor" if it had between 2 and 100 characters (note: 2 characters, although the limit is set to 1 because of the \U (uppercase character).
ABSTRACT\n(.*)\n\n\U[\w\s]{1,100}\n\n

Performing incremental regex searches in huge strings (Python)

Using Python 2.6.6.
I was hoping that the re module provided some method of searching that mimicked the way str.find() works, allowing you to specify a start index, but apparently not...
search() lets me find the first match...
findall() will return all (non-overlapping!) matches of a single pattern
finditer() is like findall(), but via an iterator (more efficient)
Here is the situation... I'm data mining in huge blocks of data. For parts of the parsing, regex works great. But once I find certain matches, I need to switch to a different pattern, or even use more specialized parsing to find where to start searching next. If re.search allowed me to specify a starting index, it would be perfect. But in absence of that, I'm looking at:
Using finditer(), but skipping forward until I reach an index that is past where I want to resume using re. Potential problems:
If the embedded binary data happens to contain a match that overlaps a legitimate match just after the binary chunk...
Since I'm not searching for a single pattern, I'd have to juggle multiple iterators, which also has the possibility of a false match hiding the real one.
Slicing, i.e., creating a copy of the remainder of the data each time I want to search again.
This would be robust, but would force a lot of "needless" copying on data that could be many megabytes.
I'd prefer to keep it so that all match locations were indexes into the single original string object, since I may hang onto them for a while and want to compare them. Finding subsequent matches within separate sliced-off copies is a bookkeeping hassle.
Just occurred to me that I may be able to use a "rotating buffer" sort of approach, but haven't thought it through completely. That might introduce a lot of complexity to the code.
Am I missing any obvious alternatives? Not sure if there would be a way to wrap a huge string with a class that would serve slices... Or a slicing sort of iterator or "string cursor" idiom?

Use a two-pass approach. The first pass uses the first regex to find the "interesting bits" and outputs those offsets into a separate file. You didn't say if you can tell where the "end" of each interesting segment is, but you'd include that too if available. The second pass uses the offsets to load sections of the file as independent strings and then applies whatever secondary regex you like on each smaller string.

Remove all replicas of a string more than x characters long (regex?)

I'm not certain that regex is the best approach for this, but it seems to be fairly well suited. Essentially I'm currently parsing some pdfs using pdfminer, and the drawback is that these pdf's are exported powerpoint slides, which means that all animations show up as fairly long copies of strings. Ideally I would like just one copy of each of these strings instead of a copy for each stage of an animation. Right now the current regex pattern I'm using is this:
re.sub(r"([\w^\w]{10,})\1{1,}", "\1", string)
For some reason though, this doesn't seem to change the input string. I feel like for some reason python isn't recognizing the capture group, but I'm not sure how to remedy that issue. Any thoughts appreciated.
Examples:
I would like this
text to be
reduced
I would like this
text to be
reduced
output:
I would like this
text to be
reduced
Update:
To get this to pass the pumping lemma I had to specifically make the assertion that all duplicates were adjacent. This was implied before, but I am now making it explicit to ensure that a solution is possible.

regexps are not the right tool for that task. They are based on the theory of context free languages, and they can't match if a string contains duplicates and remove the duplicates. You may find a course on automata and regexps interesting to read on the topic.
I think Josay's suggestion can be efficient and smart, but I think I got a more simple and pythonic solution, though it has its limits. You can split your string into a list of lines, and pass it through a set():
>>> s = """I would like this
... text to be
...
... reduced
... I would like this
... text to be
...
... reduced"""
>>> print "\n".join(set(s.splitlines()))
I would like this
text to be
reduced
>>>
The only thing with that solution is that you will loose the original order of the lines (the example being a pretty counter example). Also, if you have the same line in two different contexts, you will end up having only one line.
To fix the first problem, you may have to then iterate over your original string a second time to put that set back in order, or simply use an ordered set.
If you got any symbol separating each slide, it would help you merge only the duplicates, fixing the second problem of that solution.
Otherwise a more sophisticated algorithm would be needed, so you can take into account proximity and context. For that a suffix tree could be a good idea, and there are python libraries for that (cf that SO answer).
edit:
using your algorithm I could make it work, by adding support of multiline and adding spaces and endlines to your text matching:
>>> re.match(r"([\w \n]+)\n\1", string, re.MULTILINE).groups()
('I would like this\ntext to be\n\nreduced',)
Though, afaict the \1 notation is not a regular regular expression syntax in the matching part, but an extension. But it's getting late here, and I may as well be totally wrong. Maybe shall I reread those courses? :-)
I guess that the regexp engine's pushdown automata is able to push matches, because it is only a long multiline string that it can pop to match. Though I'd expect it to have side effects...

Python interval based sparse container

I am trying to create an interface between structured data and NLTK. NLP libraries generally work with bags of words, hence I need to turn my structured data into bags of words.
I need to associate the offset of a word with it's meta-data.Therefore my best bet is to have some sort of container that holds ranges as keys (allowing nested ranges) and can retrieve all the meta-data (multiple if the word offset is part of a nested range).
What code can I pickup that would do this efficiently (--i.e., sparse represention of the data ) ? Efficient because my global corpus will have at least a few hundred megabytes.
Note :
I am serialising structured forum posts. which will include posts with sections of quotes with them. I want to know which topic a word belonged to, and weather it's a quote or user-text. There will probably be additional metadata as my work progresses. Note that a word belonging to a quote is what I meant by nested meta-data, so the word is part of a quote, that belongs to a post made by a user.
I know that one can tag words in NLTK I haven't looked into it, if its possible to do what I want that way please comment. But I am still looking for the original approach.
There is probably something in numpy that can solve my problem, looking at that now
edit
The input data is far too complex to rip out and post. I have found what I was looking for tho http://packages.python.org/PyICL/. I needed to talk about intervals and not ranges :D I have used boost extensively, however making that a dependency makes me a bit uneasy (Sadly, I am having compiler errors with PyICL :( ).
The question now is: anyone know an interval container library or data structure that can be used to index nested intervals in a sparse fashion. Or put differently provides similar semantics to boost.icl

If you don't want to use PyICL or boost.icl Instead of relying on a specialized library you could just use sqlite3 to do the job ? If you use an in0memory version it will still be a few orders of magnitudes slower than boost.icl (from experience coding other data structures vs sqlite3) but should be more effective than using a c++ std::vector style approach on top of python containers.
You can use two integers and have date_type_low < offset < date_type_high predicate in your where clause. And depending on your table structure this will return nested/overlapping ranges.

How do I build an efficient email filter against large ruleset (5000+ and growing)

I'm building an email filter and I need a way to efficiently match a single email to a large number of filters/rules. The email can be matched on any of the following fields:
From name
From address
Sender name
Sender address
Subject
Message body
Presently there are over 5000 filters (and growing) which are all defined in a single table in our PostgreSQL (9.1) database. Each filter may have 1 or more of the above fields populated with a Python regular expression.
The way filtering is currently being done is to select all filters and load them into memory. We then iterate over them for each email until a positive match is found on all non-blank fields. Unfortunately this means for any one email there can potentially be as many as 30,000 (5000 x 6) re.match operations. Clearly this won't scale as more filters get added (actually it already doesn't).
Is there a better way to do this?
Options I've considered so far:
Converting saved python regular expressions to POSIX style ones to make use of PostgreSQL's SIMILAR TO expression. Will this really be any quicker? Seems to me like it's simply shifting the load somewhere else.
Defining filters on a per user basis. Though this isn't really practical because with our system users actually benefit from a wealth of predefined filters.
Switching to a document-based search engine like elastic search where the first email to be filtered is saved as the canonical representation. By finding similar emails we can then narrow down to a specific feature set to test on and get a positive match.
Switching to a bayes filter which would also give us some machine learning capability to detect similar emails or changes to existing emails that would still match with a high enough probability to guess that they were the same thing. This sounds cool but I'm not sure it would scale particularly well either.
Are there other options or approaches to consider?

The trigram support in PostgreSQL version 9.1 might give you what you want.
http://www.postgresql.org/docs/9.1/interactive/pgtrgm.html
It almost certainly will be a viable solution in 9.2 (scheduled for release in summer of 2012), since the new version knows how to use a trigram index for fast matching against regular expressions. At our shop we have found the speed of trigram indexes to be very good.
Also, if you ever want to do a "nearest neighbor" search, where you find the K best matches based on similarity to a search argument, a trigram index is wonderful -- it actually returns rows from the index scan in order of "distance". Search for KNN-GiST for write-ups.

how complex are these regexps? if they really are regular expressions (without all the crazy python extensions) then you can combine them all into a single regexp (as alternatives) and then use a simple (ie in-memory) regexp matcher.
i am not sure this will work, but i suspect that you will be pleasantly surprised that the regexp compiles to a significantly smaller state machine, because it will merge common states.
also, for a fast regular expression engine, consider using nrgrep which will fast scan. this should give you a speedup when jumping from one header to the next (i haven't use it myself, but they're friends of friends and the paper looked pretty neat when i read it).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.