Is there any practical difference in power between a 'regular expression' as exampled by NLTK's docs and a CFG from the same? There definitely should be, since there are context-free languages which are not regular, but I can't find a concrete example where the CFG approach outshines a regular expression.
http://nltk.org/book/ch07.html
From the documentation of RegexpParser:
The patterns of a clause are executed in order. An earlier
pattern may introduce a chunk boundary that prevents a later
pattern from executing. Sometimes an individual pattern will
match on multiple, overlapping extents of the input. As with
regular expression substitution more generally, the chunker will
identify the first match possible, then continue looking for matches
after this one has ended.
The clauses of a grammar are also executed in order. A cascaded
chunk parser is one having more than one clause. The maximum depth
of a parse tree created by this chunk parser is the same as the
number of clauses in the grammar.
That is, each clause/pattern is executed once. Thus you'll run into trouble as soon as you need the output of a later clause to be matched by an earlier one.
A practical example is the way something that could be a complete sentence on its own can be used as a clause in a larger sentence:
The cat purred.
He heard that the cat purred.
She saw that he heard that the cat purred.
As we can read from the documentation above, when you construct a RegexpParser you're setting an arbitrary limit for the "depth" of this sort of sentence. There is no "recursion limit" for context-free grammars.
The documentation mentions that you can use looping to mitigate this somewhat -- if you run through a suitable grammar two or three or four times, you can get a deeper parse. You can add external logic to loop your grammar many times, or until nothing more can be parsed.
However, as the documentation also notes, the basic approach of this parser is still "greedy". It proceeds like this for a fixed or variable number of steps:
Do as much chunking as you can in one step.
Use the output of the last step as the input of the next step, and repeat.
This is naïve because if an early step makes a mistake, this will ruin the whole parse.
Think of a "garden path sentence":
The horse raced past the barn fell.
And a similar string but an entirely different sentence:
The horse raced past the barn.
It will likely be hard to construct a RegexpParser that will parse both of these sentences, because the approach relies on the initial chunking being correct. Correct initial chunking for one will probably be incorrect initial chunking for the other, yet you can't know "which sentence you're in" until you're at a late level in the parsing logic.
For instance, if "the barn fell" is chunked together early on, the parse will fail.
You can add external logic to backtrack when you end up with a "poor" parse, to see if you can find a better one. However, I think you'll find that at that point, more of the important parts of the parsing algorithm are in your external logic, instead of in RegexpParser.
Related
I wrote a lexical analyzer for cpp codes in python, but the problem is when I use input.split(" ") it won't recognize codes like x=2 or function() as three different tokens unless I add an space between them manually, like: x = 2 .
also it fails to recognize the tokens at the beginning of each line.
(if i add spaces between each two tokens and also at the beginning of each line, my code works correctly)
I tried splitting the code first by lines then by space but it got complicated and still I wasn't able to solve the first problem.
Also I thought about splitting it by operators, yet I couldn't actually implement it. plus I need the operators to be recognized as tokens as well, so this might not be a good idea.
I would appreciate it if anyone could give any solution or suggestion, Thank You.
f=open("code.txt")
input=f.read()
input=input.split(" ")
f=open("code.txt")
input=f.read()
input1=input.split("\n")
for var in input1:
var=var.split(" ")
Obviously, if you try to have success splitting such an expression like x=2 and also x = 2... it seems pretty obvious that isn't going to work.
What you are looking is for a solution that works with both right?
Basic solution is to use an and operator, and use the conditions that you need to parse. Note that this solution isn't scalable, neither fits into the category of good practices, but it can help you to figure out better but harder solutions.
if input.split(' ') and input.split('='):
An intermediate solution would be to use regex.
Regex isn't an easy topic, but you can checkout online documentation, and then you have wonderful online tools to check your regex codes.
Regex 101
The last one, would be to convert your input data into an AST, which stands for abstract syntax tree. This is the technique employed by C++ compilers like, for example, Clang.
This last one is a real hard topic, so for figure out a basic lexer, probably will be really time consuming, but maybe it could fit your needs.
The usual approach is to scan the incoming text from left to right. At each character position, the lexical analyser selects the longest string which fits some pattern for a "lexeme", which is either a token or ignored input (whitespace and comments, for example). Then the scan continues at the next character.
Lexical patterns are often described using regular expressions, but the standard regular expression module re is not as much help as it could be for this procedure, because it does not have the facility of checking multiple regular expressions in parallel. (And neither does the possible future replacement, the regex module.) Or, more precisely, the library can check multiple expressions in parallel (using alternation syntax, (...|...|...)), but it lacks an interface which can report which of the alternatives was matched. [Note 1]. So it would be necessary to try every possible pattern one at a time and select whichever one turns out to have the longest match.
Note that the matches are always anchored at the current input point; the lexical analyser does not search for a matching pattern. Every input character becomes part of some lexeme, even if that lexeme is ignored, and lexemes do not overlap.
You can write such an analyser by hand for a simple language, but C++ is hardly a simple language. Hand-built lexical analysers most certainly exist, but all the ones I've seen are thousands of lines of not very readable code. So it's usually easier to build an analyzer automatically using software designed for that purpose. These have been around for a long time -- Lex was written almost 50 years ago, for example -- and if you are planning on writing more than one lexical analyser, you would be well advised to investigate some of the available tools.
Notes
The PCRE2 and Oniguruma regex libraries provide a "callout" feature which I believe could be used for this purpose. I haven't actually seen it used in lexical analysis, but it's a fairly recent addition, particularly for Oniguruma, and as far as I can see, the Python bindings for those two libraries do not wrap the callout feature. (Although, as usual with Python bindings to C libraries, documentation is almost non-existent, so I can't say for certain.)
I am implementing an existing scripting language in part as a toy project and in part so that I can write my own implementation of the program that uses the language. One of the issues I'm running into is that I have a few constructs that overlap in terms of specification but are more clear when used:
Variables - r'[A-Za-z0-9_]+' # Yes, '456' is a valid variable name
Numbers - r'-?[0-9]+(\.[0-9]+)?'
Macros - r'\#[A-Za-z0-9_]+'
Field Reference - r'(this\.)?([A-Za-z]+\.)*[A-Za-z]+'
Tag reference - r'[A-Za-z0-9_]+\.[A-Za-z0-9_]*\??'
This mostly works, but, for example, "456" could be a number or a variable. "34.567" could be a number or a tag reference (the documentation for the scripting language says that it's a bad idea to start identifiers with numbers, but doesn't outright forbid it). Is there a good way to handle the potential ambiguity of the tokens? Currently, I'm tokenizing the former as variable, and the latter as a number, and handling it later in the parser, but it feels very clumsy.
Is there any need for the tokenizer to distinguish between variables, numbers, field references and tag references? Presumably, the parser will be able to decide which of those categories a particular token falls into, by consulting its symbol table of declared variables and possibly by considering the context in which the token was used. If that's the case, then you can just return a single token for all four cases, which will simplify your lexer and probably your grammar.
There's a general principle of parser design, which is never sufficiently emphasised, so I'll put it in bold here:
Every parser component should do the absolute minimum amount of work necessary to distinguish between correct inputs.
In other words, if the only possibilities are a unique correct parse and an input error, and it's at all difficult to decide at that point which applies, then just pass the decision on to the next phases, where more information is available. Only do the work necessary to distinguish between two or more different correct inputs.
This applies, for example, to trying to do type-checking in the parser. That's a losing proposition; there isn't enough information to do it correctly until semantic analysis is complete and you know what all of the identifiers refer to. More importantly, it adds no benefit to the parser (or the lexer) because it does not affect how a correct input is parsed; all it does is let you identify certain (not all) incorrect inputs. By the above principle, you shouldn't try.
This principle comes up over and over again in parsing. There is always the temptation to try to make error detection "more precise" too early in the parse. Resist! Do error detection only when you have enough information to do it reliably. You'll have to do it at that point anyway, so you're not saving anything by trying to do some of it earlier. Early detection might shave a few microseconds off of a failed parse, but the speed of parsing incorrect inputs is not very important. Always optimise for correct inputs.
This also applies to writing grammars for syntaxes which are not easy to precisely shoehorn into a one-token lookahead grammar. It's OK to let an incorrect input to sneak through the parse and then detect it during semantic analysis. For example, you could try to detect whether built-in function calls have the correct number of arguments. But why bother? Letting a call with too many or too few arguments go through to semantic analysis does not create any ambiguities. There are lots of other examples.
Other big benefits of letting errors trickle down to the semantic analysis are that it's much easier to generate accurate error messages, which are useful for the end user, and that it's much easier to do error recovery, so you can continue processing the input and provide multiple errors and warnings in a single run, another feature your users will appreciate.
There are exceptions to every guideline, so I'm not saying this is an absolute rule. In COBOL, for example, some operators have different parsing precedences depending on their datatype. (No sensible language designer would commit that barbarity today, I hope, but you do need to take it into account for legacy parsers.) You can only pass a decision down the line if it doesn't create ambiguities between correct inputs. But you should always try to keep this guideline in mind.
I am going through the algorithms course on MIT OCW. It is mentioned in a lecture that we have to be careful in using re.findall since re can be generally exponential complexity algorithm.
Is this a concern when parsing large files or datasets and is there an alternative to regular expressions for efficiently extracting patterns from data?
that depends on what you want to do.
In general, use the simplest tool that is required to do the task.
in would, I imagine, be much more efficient that regular expressions, but doesn't allow wildcards, repeats etc. If the pattern you are looking for is all on one line, you can search on one line at a time, processing each one (and getting it out of memory) before the next line. If you are looking for the start of a string or the end, they use mystring.startswith() or mystring.endswith() - these are more efficient.
You might be able to split the data into more manageable chunks.
If you want multi-line searches, which wont be at the beginning or end, and include wildcards or repeats... you might be stuck with regexes.
From Perl's documentation:
study takes extra time to study SCALAR ($_ if unspecified) in anticipation of doing
many pattern matches on the string before it is next modified. This may or may not save
time, depending on the nature and number of patterns you are searching and the distribution
of character frequencies in the string to be searched;
I'm trying to speed up some regular expression-driven parsing that I'm doing in Python, and I remembered this trick from Perl. I realize I'll have to benchmark to determine if there is a speedup, but I can't find an equivalent method in Python.
Perl’s study doesn’t really do much anymore. The regex compiled has gotten a whole, whole lot smarter than it was when study was created.
For example, it compiles alternatives into a trie structure with Aho–Corasick prediction.
Run with perl -Mre=debug to see the sorts of cleverness the regex compiler and execution engine apply.
As far as I know there's nothing like this built into Python. But according to the perldoc:
The way study works is this: a linked list of every character in the
string to be searched is made, so we know, for example, where all the
'k' characters are. From each search string, the rarest character is
selected, based on some static frequency tables constructed from some
C programs and English text. Only those places that contain this
"rarest" character are examined.
This doesn't sound very sophisticated, and you could probably hack together something equivalent yourself.
esmre is kind of vaguely similar. And as #Frg noted, you'll want to use re.compile if you're reusing a single regex (to avoid re-parsing the regex itself over and over).
Or you could use suffix trees (here's one implementation, or here's a C extension with unicode support) or suffix arrays (implementation).
I am seeking direction and attempting to label this problem:
I am attempting to build a simple inference engine (is there a better name?) in Python which will take a string and -
1 - create a list of tokens by simply creating a list of white space separated values
2 - categorise these tokens, using regular expressions
3 - Use a higher level set of rules to make decisions based on the categorisations
Example:
"90001" - one token, matches the zipcode regex, a rule exists for a string containing just a zipcode causes a certain behaviour to occur
"30 + 14" - three tokens, regexs for numerical value and mathematical operators match, a rule exists for a numerical value followed by a mathematical operator followed by another numerical value causes a certain behaviour to occur
I'm struggling with how best to do step #3, the higher level set of rules. I'm sure that some framework must exist. Any ideas? Also, how would you characterise this problem? Rule based system, expert system, inference engine, something else?
Thanks!
I'm very surprised that step #3 is the one giving you trouble...
Assuming you can label/categorize properly each token (and that prior to categorization you can find the proper tokens, as there may be many ambiguous cases...), the "Step #3" problem seems one that could easily be tackled with a context free grammar where each of the desired actions (such as ZIP code lookup or Mathematical expression calculation...) would be symbols with their production rule itself made of the possible token categories. To illustrate this in BNF notation, we could have something like
<SimpleMathOperation> ::= <NumericalValue><Operator><NumericalValue>
Maybe your concern is that when things get complicated, it will become difficult to express the whole requirement in terms of non-conflicting grammar rules. Or maybe your concern is that one could add rules dynamically, hence forcing the grammar "compilation" logic to be integrated with the program ? Whatever the concern, I think that this 3rd step will comparatively be trivial.
On the other hand, and unless the various categories (and underlying input text) are such that they can be described with a regular language as well (as you seem to hint in the question), a text parser and classifier (Steps #1 and #2...) is typically a less than trivial affair..
Some example Python libraries that simplify writing and evaluating grammars:
pijnu
pyparsing
It looks like you search for "grammar inference" (grammar induction) library.