python rply reverse parser - python

I'm using rply and Python3.6 to create a lexer and a parser for a little privat project.
But what I noticed is that the parser appears to flip the order of the lexerstream.
This is the file I'm parsing:
let test:string = "test";
print(test);
Lexer output:
Token('LET', 'let')
Token('NAME', 'test')
Token('COLON', ':')
Token('NAME', 'string')
Token('EQUALS', '=')
Token('STRING', '"test"')
Token('SEMI_COLON', ';')
Token('PRINT', 'print')
Token('OPEN_PARENS', '(')
Token('STRING', '"test"')
Token('CLOSE_PARENS', ')')
Token('SEMI_COLON', ';')
As you can see it is in the order of the script.
I use the parser to create a variable with name test, type string and value test. Then I want to print the variable.
It does create the variable but when I want to print it out, there is nothing.
But when I flip the script like this
print(test);
let test:string = "test";
it is able to print the value correctly.
The two parser 'rules' look like this:
Print:
#self.pg.production('expression : PRINT OPEN_PARENS expression CLOSE_PARENS SEMI_COLON expression')
def print_s(p):
...
Create variable:
#self.pg.production('expression : LET expression COLON expression EQUALS expression SEMI_COLON expression')
def create_var(p):
...
So my question is: How can I flip the order in which the content is parsed?
Edit: I looked for similar questions or problems and also in the documentation but didn't find anything.

Here's a somewhat simpler example; hopefully, you can see the pattern.
The key insight is that reduction actions (that is, the parser functions) are executed when the production's match has been fully parsed. That means that if a production contains non-terminals, the actions for those non-terminals are executed before the action for the whole production.
It should be clear why this is true. Every production action depends on the semantic values of all of the components, and in the case of non-terminals those values are produced by running the corresponding actions.
Now, consider these two very similar ways to parse a list of things. In both cases, we assume there is a base production which recognises an empty list (list :) and does nothing.
Right recursion:
list : thing list
Left recursion:
list : list thing
In both cases, the action prints the thing, which is p[0] in the right-recursive case, and p[1] in the left-recursive one.
The right-recursive production will cause the things to be printed in reverse order, because printing the thing doesn't happen until after the internal list is parsed (and it's components are printed).
But the left-recursive production will print the things in left-to-right order, for the same reason. The difference is tgat in the left-recursive case, the internal (recursive) list contains the initial things while in the right-recursive case, the list contains the final things.
If you were just building a Python list of things, this probably wouldn't matter much, since execution order wouldn't be important. It's only visible in this example because the action has a side-effect (printing a value), which makes execution order visible.
There are other techniques to order actions, in the rare cases where it is really necessary. But best practice is to always use left-recursion whenever it is syntactically practical. Left-recursive parsers are more efficient because the parser doesn't need to accumulate a stack of incomplete productions. And left-recursion is often better for your actions as well.
Here, for example, the left-recursive action could append the new value (p[0].append(p[1]); return p[0]), while the right-recursive action needs to create a new list (return [p[0] + p[1]). Since repeated appending is on average linear time while repeated concatenation is quadratic, the left-recursive parser is more scalable for large lists.

Related

Nested list comprehension against .NET Arrays within Arrays

I have a .NET structure that has arrays with arrays in it. I want to crete a list of members of items from a specific array in a specific array using list comprehension in IronPython, if possible.
Here is what I am doing now:
tag_results = [item_result for item_result in results.ItemResults if item_result.ItemId == tag_id][0]
tag_vqts = [vqt for vqt in tag_results.VQTs]
tag_timestamps = [vqt.TimeStamp for vqt in tag_vqts]
So, get the single item result from the results array which matches my condition, then get the vqts arrays from those item results, THEN get all the timestamp members for each VQT in the vqts array.
Is wanting to do this in a single statement overkill? Later on, the timestamps are used in this manner:
vqts_to_write = [vqt for vqt in resampled_vqts if not vqt.TimeStamp in tag_timestamps]
I am not sure if a generator would be appropriate, since I am not really looping through them, I just want a list of all the timestamps for all the item results for this item/tag so that I can test membership in the list.
I have to do this multiple times for different contexts in my script, so I was just wondering if I am doing this in an efficient and pythonic manner. I am refactoring this into a method, which got me thinking about making it easier.
FYI, this is IronPython 2.6, embedded in a fixed environment that does not allow the use of numpy, pandas, etc. It is safe to assume I need a python 2.6 only solution.
My main question is:
Would collapsing this into a single line, if possible, obfuscate the code?
If collapsing is appropriate, would a method be overkill?
Two! My two main questions are:
Would collapsing this into a single line, if possible, obfuscate the code?
If collapsing is appropriate, would a method be overkill?
Is a generator appropriate for testing membership in a list?
Three! My three questions are... Amongst my questions are such diverse queries as...I'll come in again...
(it IS python...)
tag_results = [...][0] builds a whole new list just to get one item. This is what next() on a generator expression is for:
next(item_result for item_result in results.ItemResults if item_result.ItemId == tag_id)
which only iterates just enough to get a first item.
You can inline that, but I'd keep that as a separate expression for readability.
The remainder is easily put into one expression:
tag_results = next(item_result for item_result in results.ItemResults
if item_result.ItemId == tag_id)
tag_timestamps = [vqt.TimeStamp for vqt in tag_results.VQTs]
I'd make that a set if you only need to do membership testing:
tag_timestamps = set(vqt.TimeStamp for vqt in tag_results.VQTs)
Sets allow for constant time membership tests; testing against a list takes linear time as the whole list could end up being scanned for each such test.

Does Pyparsing Support Context-Sensitive Grammars?

Forgive me if I have the incorrect terminology; perhaps just getting the "right" words to describe what I want is enough for me to find the answer on my own.
I am working on a parser for ODL (Object Description Language), an arcane language that as far as I can tell is now used only by NASA PDS (Planetary Data Systems; it's how NASA makes its data available to the public). Fortunately, PDS is finally moving to XML, but I still have to write software for a mission that fell just before the cutoff.
ODL defines objects in something like the following manner:
OBJECT = TABLE
ROWS = 128
ROW_BYTES = 512
END_OBJECT = TABLE
I am attempting to write a parser with pyparsing, and I was doing fine right up until I came to the above construction.
I have to create some rule that is able to ensure that the right-hand-value of the OBJECT line is identical to the RHV of END_OBJECT. But I can't seem to put that into a pyparsing rule. I can ensure that both are syntactically valid values, but I can't go the extra step and ensure that the values are identical.
Am I correct in my intuition that this is a context-sensitive grammar? Is that the phrase I should be using to describe this problem?
Whatever kind of grammar this is in the theoretical sense, is pyparsing able to handle this kind of construction?
If pyparsing is not able to handle it, is there another Python tool capable of doing so? How about ply (the Python implementation of lex/yacc)?
It is in fact a grammar for a context-sensitive language, classically abstracted as wcw where w is in (a|b)* (note that wcw' , where ' indicates reversal, is context-free).
Parsing Expression Grammars are capable of parsing wcw-type languages by using semantic predicates. PyParsing provides the matchPreviousExpr() and matchPreviousLiteral() helper methods for this very purpose, e.g.
w = Word("ab")
s = w + "c" + matchPreviousExpr(w)
So in your case you'd probably do something like
table_name = Word(alphas, alphanums)
object = Literal("OBJECT") + "=" + table_name + ... +
Literal("END_OBJECT") + "=" +matchPreviousExpr(table_name)
As a general rule, parsers are built as context-free parsing engines. If there is context sensitivity, it is grafted on after parsing (or at least after the relevant parsing steps are completed).
In your case, you want to write context-free grammar rules:
head = 'OBJECT' '=' IDENTIFIER ;
tail = 'END_OBJECT' '=' IDENTIFIER ;
element = IDENTIFIER '=' value ;
element_list = element ;
element_list = element_list element ;
block = head element_list tail ;
The checks that the head and tail constructs have matching identifiers isn't technically done by the parser.
Many parsers, however, allow a semantic action to occur when a syntactic element is recognized, often for the purpose of building tree nodes. In your case, you want
to use this to enable additional checking. For element, you want to make sure the IDENTIFIER isn't a duplicate of something already in the block; this means for each element encountered, you'll want to capture the corresponding IDENTIFIER and make a block-specific list to enable duplicate checking. For block, you want to capture the head *IDENTIFIER*, and check that it matches the tail *IDENTIFIER*.
This is easiest if you build a tree representing the parse as you go along, and hang the various context-sensitive values on the tree in various places (e.g., attach the actual IDENTIFIER value to the tree node for the head clause). At the point where you are building the tree node for the tail construct, it should be straightforward to walk up the tree, find the head tree, and then compare the identifiers.
This is easier to think about if you imagine the entire tree being built first, and then a post-processing pass over the tree is used to this checking. Lazy people in fact do it this way :-} All we are doing is pushing work that could be done in the post processing step, into the tree-building steps attached to the semantic actions.
None of these concepts is python specific, and the details for PyParsing will vary somewhat.

Python lxml - How to remove empty repeated tags

I have some XML that is generated by a script that may or may not have empty elements. I was told that now we cannot have empty elements in the XML. Here is an example:
<customer>
<govId>
<id>#</id>
<idType>SSN</idType>
<issueDate/>
<expireDate/>
<dob/>
<state/>
<county/>
<country/>
</govId>
<govId>
<id/>
<idType/>
<issueDate/>
<expireDate/>
<dob/>
<state/>
<county/>
<country/>
</govId>
</customer>
The output should look like this:
<customer>
<govId>
<id>#</id>
<idType>SSN</idType>
</govId>
</customer>
I need to remove all the empty elements. You'll note that my code took out the empty stuff in the "govId" sub-element, but didn't take out anything in the second. I am using lxml.objectify at the moment.
Here is basically what I am doing:
root = objectify.fromstring(xml)
for customer in root.customers.iterchildren():
for e in customer.govId.iterchildren():
if not e.text:
customer.govId.remove(e)
Does anyone know of a way to do this with lxml objectify or is there an easier way period? I would also like to remove the second "govId" element in its entirety if all its elements are empty.
First of all, the problem with your code is that you are iterating over customers, but not over govIds. On the third line you take the first govId for every customer, and iterate over its children. So, you'd need a another for loop for the code to work like you intended it to.
This small sentence at the end of your question then makes the problem quite a bit more complex: I would also like to remove the second "govId" element in its entirety if all its elements are empty.
This means, unless you want to hard code just checking one level of nesting, you need to recursively check if an element and it's children are empty. Like this for example:
def recursively_empty(e):
if e.text:
return False
return all((recursively_empty(c) for c in e.iterchildren()))
Note: Python 2.5+ because of the use of the all() builtin.
You then can change your code to something like this to remove all the elements in the document that are empty all the way down.
# Walk over all elements in the tree and remove all
# nodes that are recursively empty
context = etree.iterwalk(root)
for action, elem in context:
parent = elem.getparent()
if recursively_empty(elem):
parent.remove(elem)
Sample output:
<customer>
<govId>
<id>#</id>
<idType>SSN</idType>
</govId>
</customer>
One thing you might want to do is refine the condition if e.text: in the recursive function. Currently this will consider None and the empty string as empty, but not whitespace like spaces and newlines. Use str.strip() if that's part of your definition of "empty".
Edit: As pointed out by #Dave, the recursive function could be improved by using a generator expression:
return all((recursively_empty(c) for c in e.getchildren()))
This will not evaluate recursively_empty(c) for all the children at once, but evaluate it for each one lazily. Since all() will stop iteration upon the first False element, this could mean a significant performance improvement.
Edit 2: The expression can be further optimized by using e.iterchildren() instead of e.getchildren(). This works with the lxml etree API and the objectify API.

Difference between two "contains" operations for python lists

I'm fairly new to python and have found that I need to query a list about whether it contains a certain item.
The majority of the postings I have seen on various websites (including this similar stackoverflow question) have all suggested something along the lines of
for i in list
if i == thingIAmLookingFor
return True
However, I have also found from one lone forum that
if thingIAmLookingFor in list
# do work
works.
I am wondering if the if thing in list method is shorthand for the for i in list method, or if it is implemented differently.
I would also like to which, if either, is more preferred.
In your simple example it is of course better to use in.
However... in the question you link to, in doesn't work (at least not directly) because the OP does not want to find an object that is equal to something, but an object whose attribute n is equal to something.
One answer does mention using in on a list comprehension, though I'm not sure why a generator expression wasn't used instead:
if 5 in (data.n for data in myList):
print "Found it"
But this is hardly much of an improvement over the other approaches, such as this one using any:
if any(data.n == 5 for data in myList):
print "Found it"
the "if x in thing:" format is strongly preferred, not just because it takes less code, but it also works on other data types and is (to me) easier to read.
I'm not sure how it's implemented, but I'd expect it to be quite a lot more efficient on datatypes that are stored in a more searchable form. eg. sets or dictionary keys.
The if thing in somelist is the preferred and fastest way.
Under-the-hood that use of the in-operator translates to somelist.__contains__(thing) whose implementation is equivalent to: any((x is thing or x == thing) for x in somelist).
Note the condition tests identity and then equality.
for i in list
if i == thingIAmLookingFor
return True
The above is a terrible way to test whether an item exists in a collection. It returns True from the function, so if you need the test as part of some code you'd need to move this into a separate utility function, or add thingWasFound = False before the loop and set it to True in the if statement (and then break), either of which is several lines of boilerplate for what could be a simple expression.
Plus, if you just use thingIAmLookingFor in list, this might execute more efficiently by doing fewer Python level operations (it'll need to do the same operations, but maybe in C, as list is a builtin type). But even more importantly, if list is actually bound to some other collection like a set or a dictionary thingIAmLookingFor in list will use the hash lookup mechanism such types support and be much more efficient, while using a for loop will force Python to go through every item in turn.
Obligatory post-script: list is a terrible name for a variable that contains a list as it shadows the list builtin, which can confuse you or anyone who reads your code. You're much better off naming it something that tells you something about what it means.

Hashtable/dictionary/map lookup with regular expressions

I'm trying to figure out if there's a reasonably efficient way to perform a lookup in a dictionary (or a hash, or a map, or whatever your favorite language calls it) where the keys are regular expressions and strings are looked up against the set of keys. For example (in Python syntax):
>>> regex_dict = { re.compile(r'foo.') : 12, re.compile(r'^FileN.*$') : 35 }
>>> regex_dict['food']
12
>>> regex_dict['foot in my mouth']
12
>>> regex_dict['FileNotFoundException: file.x does not exist']
35
(Obviously the above example won't work as written in Python, but that's the sort of thing I'd like to be able to do.)
I can think of a naive way to implement this, in which I iterate over all of the keys in the dictionary and try to match the passed in string against them, but then I lose the O(1) lookup time of a hash map and instead have O(n), where n is the number of keys in my dictionary. This is potentially a big deal, as I expect this dictionary to grow very large, and I will need to search it over and over again (actually I'll need to iterate over it for every line I read in a text file, and the files can be hundreds of megabytes in size).
Is there a way to accomplish this, without resorting to O(n) efficiency?
Alternatively, if you know of a way to accomplish this sort of a lookup in a database, that would be great, too.
(Any programming language is fine -- I'm using Python, but I'm more interested in the data structures and algorithms here.)
Someone pointed out that more than one match is possible, and that's absolutely correct. Ideally in this situation I'd like to return a list or tuple containing all of the matches. I'd settle for the first match, though.
I can't see O(1) being possible in that scenario; I'd settle for anything less than O(n), though. Also, the underlying data structure could be anything, but the basic behavior I'd like is what I've written above: lookup a string, and return the value(s) that match the regular expression keys.
What you want to do is very similar to what is supported by xrdb. They only support a fairly minimal notion of globbing however.
Internally you can implement a larger family of regular languages than theirs by storing your regular expressions as a character trie.
single characters just become trie nodes.
.'s become wildcard insertions covering all children of the current trie node.
*'s become back links in the trie to node at the start of the previous item.
[a-z] ranges insert the same subsequent child nodes repeatedly under each of the characters in the range. With care, while inserts/updates may be somewhat expensive the search can be linear in the size of the string. With some placeholder stuff the common combinatorial explosion cases can be kept under control.
(foo)|(bar) nodes become multiple insertions
This doesn't handle regexes that occur at arbitrary points in the string, but that can be modeled by wrapping your regex with .* on either side.
Perl has a couple of Text::Trie -like modules you can raid for ideas. (Heck I think I even wrote one of them way back when)
This is not possible to do with a regular hash table in any language. You'll either have to iterate through the entire keyset, attempting to match the key to your regex, or use a different data structure.
You should choose a data structure that is appropriate to the problem you're trying to solve. If you have to match against any arbitrary regular expression, I don't know of a good solution. If the class of regular expressions you'll be using is more restrictive, you might be able to use a data structure such as a trie or suffix tree.
In the general case, what you need is a lexer generator. It takes a bunch of regular expressions and compiles them into a recognizer. "lex" will work if you are using C. I have never used a lexer generator in Python, but there seem to be a few to choose from. Google shows PLY, PyGgy and PyLexer.
If the regular expressions all resemble each other in some way, then you may be able to take some shortcuts. We would need to know more about the ultimate problem that you are trying to solve in order to come up with any suggestions. Can you share some sample regular expressions and some sample data?
Also, how many regular expressions are you dealing with here? Are you sure that the naive approach won't work? As Rob Pike once said, "Fancy algorithms are slow when n is small, and n is usually small." Unless you have thousands of regular expressions, and thousands of things to match against them, and this is an interactive application where a user is waiting for you, you may be best off just doing it the easy way and looping through the regular expressions.
This is definitely possible, as long as you're using 'real' regular expressions. A textbook regular expression is something that can be recognized by a deterministic finite state machine, which primarily means you can't have back-references in there.
There's a property of regular languages that "the union of two regular languages is regular", meaning that you can recognize an arbitrary number of regular expressions at once with a single state machine. The state machine runs in O(1) time with respect to the number of expressions (it runs in O(n) time with respect to the length of the input string, but hash tables do too).
Once the state machine completes you'll know which expressions matched, and from there it's easy to look up values in O(1) time.
What happens if you have a dictionary such as
regex_dict = { re.compile("foo.*"): 5, re.compile("f.*"): 6 }
In this case regex_dict["food"] could legitimately return either 5 or 6.
Even ignoring that problem, there's probably no way to do this efficiently with the regex module. Instead, what you'd need is an internal directed graph or tree structure.
What about the following:
class redict(dict):
def __init__(self, d):
dict.__init__(self, d)
def __getitem__(self, regex):
r = re.compile(regex)
mkeys = filter(r.match, self.keys())
for i in mkeys:
yield dict.__getitem__(self, i)
It's basically a subclass of the dict type in Python. With this you can supply a regular expression as a key, and the values of all keys that match this regex are returned in an iterable fashion using yield.
With this you can do the following:
>>> keys = ["a", "b", "c", "ab", "ce", "de"]
>>> vals = range(0,len(keys))
>>> red = redict(zip(keys, vals))
>>> for i in red[r"^.e$"]:
... print i
...
5
4
>>>
Here's an efficient way to do it by combining the keys into a single compiled regexp, and so not requiring any looping over key patterns. It abuses the lastindex to find out which key matched. (It's a shame regexp libraries don't let you tag the terminal state of the DFA that a regexp is compiled to, or this would be less of a hack.)
The expression is compiled once, and will produce a fast matcher that doesn't have to search sequentially. Common prefixes are compiled together in the DFA, so each character in the key is matched once, not many times, unlike some of the other suggested solutions. You're effectively compiling a mini lexer for your keyspace.
This map isn't extensible (can't define new keys) without recompiling the regexp, but it can be handy for some situations.
# Regular expression map
# Abuses match.lastindex to figure out which key was matched
# (i.e. to emulate extracting the terminal state of the DFA of the regexp engine)
# Mostly for amusement.
# Richard Brooksby, Ravenbrook Limited, 2013-06-01
import re
class ReMap(object):
def __init__(self, items):
if not items:
items = [(r'epsilon^', None)] # Match nothing
key_patterns = []
self.lookup = {}
index = 1
for key, value in items:
# Ensure there are no capturing parens in the key, because
# that would mess up match.lastindex
key_patterns.append('(' + re.sub(r'\((?!\?:)', '(?:', key) + ')')
self.lookup[index] = value
index += 1
self.keys_re = re.compile('|'.join(key_patterns))
def __getitem__(self, key):
m = self.keys_re.match(key)
if m:
return self.lookup[m.lastindex]
raise KeyError(key)
if __name__ == '__main__':
remap = ReMap([(r'foo.', 12), (r'FileN.*', 35)])
print remap['food']
print remap['foot in my mouth']
print remap['FileNotFoundException: file.x does not exist']
There is a Perl module that does just this Tie::Hash::Regex.
use Tie::Hash::Regex;
my %h;
tie %h, 'Tie::Hash::Regex';
$h{key} = 'value';
$h{key2} = 'another value';
$h{stuff} = 'something else';
print $h{key}; # prints 'value'
print $h{2}; # prints 'another value'
print $h{'^s'}; # prints 'something else'
print tied(%h)->FETCH(k); # prints 'value' and 'another value'
delete $h{k}; # deletes $h{key} and $h{key2};
#rptb1 you don't have to avoid capturing groups, because you can use re.groups to count them. Like this:
# Regular expression map
# Abuses match.lastindex to figure out which key was matched
# (i.e. to emulate extracting the terminal state of the DFA of the regexp engine)
# Mostly for amusement.
# Richard Brooksby, Ravenbrook Limited, 2013-06-01
import re
class ReMap(object):
def __init__(self, items):
if not items:
items = [(r'epsilon^', None)] # Match nothing
self.re = re.compile('|'.join('('+k+')' for (k,v) in items))
self.lookup = {}
index = 1
for key, value in items:
self.lookup[index] = value
index += re.compile(key).groups + 1
def __getitem__(self, key):
m = self.re.match(key)
if m:
return self.lookup[m.lastindex]
raise KeyError(key)
def test():
remap = ReMap([(r'foo.', 12),
(r'.*([0-9]+)', 99),
(r'FileN.*', 35),
])
print remap['food']
print remap['foot in my mouth']
print remap['FileNotFoundException: file.x does not exist']
print remap['there were 99 trombones']
print remap['food costs $18']
print remap['bar']
if __name__ == '__main__':
test()
Sadly very few RE engines actually compile the regexps down to machine code, although it's not especially hard to do. I suspect there's an order of magnitude performance improvement waiting for someone to make a really good RE JIT library.
As other respondents have pointed out, it's not possible to do this with a hash table in constant time.
One approximation that might help is to use a technique called "n-grams". Create an inverted index from n-character chunks of a word to the entire word. When given a pattern, split it into n-character chunks, and use the index to compute a scored list of matching words.
Even if you can't accept an approximation, in most cases this would still provide an accurate filtering mechanism so that you don't have to apply the regex to every key.
A special case of this problem came up in the 70s AI languages oriented around deductive databases. The keys in these databases could be patterns with variables -- like regular expressions without the * or | operators. They tended to use fancy extensions of trie structures for indexes. See krep*.lisp in Norvig's Paradigms of AI Programming for the general idea.
If you have a small set of possible inputs, you can cache the matches as they appear in a second dict and get O(1) for the cached values.
If the set of possible inputs is too big to cache but not infinite, either, you can just keep the last N matches in the cache (check Google for "LRU maps" - least recently used).
If you can't do this, you can try to chop down the number of regexps you have to try by checking a prefix or somesuch.
I created this exact data structure for a project once. I implemented it naively, as you suggested. I did make two immensely helpful optimizations, which may or may not be feasible for you, depending on the size of your data:
Memoizing the hash lookups
Pre-seeding the the memoization table (not sure what to call this... warming up the cache?)
To avoid the problem of multiple keys matching the input, I gave each regex key a priority and the highest priority was used.
The fundamental assumption is flawed, I think. you can't map hashes to regular expressions.
I don't think it's even theoretically possible. What happens if someone passes in a string that matches more than 1 regular expression.
For example, what would happen if someone did:
>>> regex_dict['FileNfoo']
How can something like that possibly be O(1)?
It may be possible to get the regex compiler to do most of the work for you by concatenating the search expressions into one big regexp, separated by "|". A clever regex compiler might search for commonalities in the alternatives in such a case, and devise a more efficient search strategy than simply checking each one in turn. But I have no idea whether there are compilers which will do that.
It really depends on what these regexes look like. If you don't have a lot regexes that will match almost anything like '.*' or '\d+', and instead you have regexes that contains mostly words and phrases or any fixed patterns longer than 4 characters (e.g.'a*b*c' in ^\d+a\*b\*c:\s+\w+) , as in your examples. You can do this common trick that scales well to millions of regexes:
Build a inverted index for the regexes (rabin-karp-hash('fixed pattern') -> list of regexes containing 'fixed pattern'). Then at matching time, using Rabin-Karp hashing to compute sliding hashes and look up the inverted index, advancing one character at a time. You now have O(1) look-up for inverted-index non-matches and a reasonable O(k) time for matches, k is the average length of the lists of regexes in the inverted index. k can be quite small (less than 10) for many applications. The quality (false positive means bigger k, false negative means missed matches) of the inverted index depends on how well the indexer understands the regex syntax. If the regexes are generated by human experts, they can provide hints for contained fixed patterns as well.
Ok, I have a very similar requirements, I have a lot of lines of different syntax, basically remark lines and lines with some codes for to use in a process of smart-card format, also, descriptor lines of keys and secret codes, in every case, I think that the "model" pattern/action is the beast approach for to recognize and to process a lot of lines.
I'm using C++/CLI for to develop my assembly named LanguageProcessor.dll, the core of this library is a lex_rule class that basically contains :
a Regex member
an event member
The constructor loads the regex string and call the necessary codes for to build the event on the fly using DynamicMethod, Emit and Reflexion... also into the assembly exists other class like meta and object that constructs ans instantiates the objects by the simple names of the publisher and the receiver class, receiver class provides the action handlers for each rule matched.
Late, I have a class named fasterlex_engine that build a Dictionary<Regex, action_delegate>
that load the definitions from an array for to run.
The project is in advanced point but I'm still building, today. I will try to enhance the performance of running surrounding the sequential access to every pair foreach line input, thru using some mechanism of lookup the dictionary directly using the regexp like:
map_rule[gcnew Regex("[a-zA-Z]")];
Here, some of segments of my code:
public ref class lex_rule: ILexRule
{
private:
Exception ^m_exception;
Regex ^m_pattern;
//BACKSTORAGE delegates, esto me lo aprendi asiendo la huella.net de m*e*da JEJE
yy_lexical_action ^m_yy_lexical_action;
yy_user_action ^m_yy_user_action;
public:
virtual property String ^short_id;
private:
void init(String ^_short_id, String ^well_formed_regex);
public:
lex_rule();
lex_rule(String ^_short_id,String ^well_formed_regex);
virtual event yy_lexical_action ^YY_RULE_MATCHED
{
virtual void add(yy_lexical_action ^_delegateHandle)
{
if(nullptr==m_yy_lexical_action)
m_yy_lexical_action=_delegateHandle;
}
virtual void remove(yy_lexical_action ^)
{
m_yy_lexical_action=nullptr;
}
virtual long raise(String ^id_rule, String ^input_string, String ^match_string, int index)
{
long lReturn=-1L;
if(m_yy_lexical_action)
lReturn=m_yy_lexical_action(id_rule,input_string, match_string, index);
return lReturn;
}
}
};
Now the fasterlex_engine class that execute a lot of pattern/action pair:
public ref class fasterlex_engine
{
private:
Dictionary<String^,ILexRule^> ^m_map_rules;
public:
fasterlex_engine();
fasterlex_engine(array<String ^,2>^defs);
Dictionary<String ^,Exception ^> ^load_definitions(array<String ^,2> ^defs);
void run();
};
AND FOR TO DECORATE THIS TOPIC..some code of my cpp file:
this code creates a constructor invoker by parameter sign
inline Exception ^object::builder(ConstructorInfo ^target, array<Type^> ^args)
{
try
{
DynamicMethod ^dm=gcnew DynamicMethod(
"dyna_method_by_totem_motorist",
Object::typeid,
args,
target->DeclaringType);
ILGenerator ^il=dm->GetILGenerator();
il->Emit(OpCodes::Ldarg_0);
il->Emit(OpCodes::Call,Object::typeid->GetConstructor(Type::EmptyTypes)); //invoca a constructor base
il->Emit(OpCodes::Ldarg_0);
il->Emit(OpCodes::Ldarg_1);
il->Emit(OpCodes::Newobj, target); //NewObj crea el objeto e invoca al constructor definido en target
il->Emit(OpCodes::Ret);
method_handler=(method_invoker ^) dm->CreateDelegate(method_invoker::typeid);
}
catch (Exception ^e)
{
return e;
}
return nullptr;
}
This code attach an any handler function (static or not) for to deal with a callback raised by matching of a input string
Delegate ^connection_point::hook(String ^receiver_namespace,String ^receiver_class_name, String ^handler_name)
{
Delegate ^d=nullptr;
if(connection_point::waitfor_hook<=m_state) // si es 0,1,2 o mas => intenta hookear
{
try
{
Type ^tmp=meta::_class(receiver_namespace+"."+receiver_class_name);
m_handler=tmp->GetMethod(handler_name);
m_receiver_object=Activator::CreateInstance(tmp,false);
d=m_handler->IsStatic?
Delegate::CreateDelegate(m_tdelegate,m_handler):
Delegate::CreateDelegate(m_tdelegate,m_receiver_object,m_handler);
m_add_handler=m_connection_point->GetAddMethod();
array<Object^> ^add_handler_args={d};
m_add_handler->Invoke(m_publisher_object, add_handler_args);
++m_state;
m_exception_flag=false;
}
catch(Exception ^e)
{
m_exception_flag=true;
throw gcnew Exception(e->ToString()) ;
}
}
return d;
}
finally the code that call the lexer engine:
array<String ^,2> ^defs=gcnew array<String^,2> {/* shortID pattern namespc clase fun*/
{"LETRAS", "[A-Za-z]+" ,"prueba", "manejador", "procesa_directriz"},
{"INTS", "[0-9]+" ,"prueba", "manejador", "procesa_comentario"},
{"REM", "--[^\\n]*" ,"prueba", "manejador", "nullptr"}
}; //[3,5]
//USO EL IDENTIFICADOR ESPECIAL "nullptr" para que el sistema asigne el proceso del evento a un default que realice nada
fasterlex_engine ^lex=gcnew fasterlex_engine();
Dictionary<String ^,Exception ^> ^map_error_list=lex->load_definitions(defs);
lex->run();
The problem has nothing to do with regular expressions - you'd have the same problem with a dictionary with keys as functions of lambdas. So the problem you face is figuring is there a way of classifying your functions to figure which will return true or not and that isn't a search problem because f(x) is not known in general before hand.
Distributed programming or caching answer sets assuming there are common values of x may help.
-- DM

Categories

Resources