Are regexes an acceptable method for analyzing syntax?

Are regexes an acceptable method for analyzing syntax? - python

Pardon me if this question is obvious to some, but I'm trying to teach myself how to write an interpreter. I'm doing this in python, I already have a Lexer programmed.
I've got my list of tokens created, where I'm stuck is constructing the parse tree. I have sort of an idea of where to go from here, but I'm not sure if I am thinking correctly.
This is the syntax I have defined in my grammar for a simple arithmetic expression using a regex.
<a_expression> = <identifier | number> <operator> <identifier | number>
BUT, if my parser receives a stream of tokens matching this pattern from my lexer:
<identifier | number> <operator> <identifier | number> <operator> <identifier | number>
How do I go about parsing this, since it has two operators and three operands instead of just two operands?
Moreover, how do I handle n operands and n-1 operators? I feel like this should be done recursively, but I'm not sure if I need to define more Parsers for different types of expressions or where to go from here. Can I match a pattern of n operands and n-1 operators with a regex?

While today's 'regular' expressions aren't strictly relegated to the land of Regular Languages, you'll find that you need a more powerful tool to do what you're trying to do.
Context-Free Grammars are what you want, and there are a few tools for writing CFGs in Python. Most notable is pyparsing, but there's a port of Haskell's Parsec library called Pysec that you could look into, too.

Whether a regular expression is apt to parse your syntax, depends on whether your syntax (i.e. your grammar) is regular, too, or of another Chomsky-class.
For type-0 (unrestricted) grammars you will need a Turing machine.
For type-1 (context-sensitive) you will need a linear bounded automaton (or any of the above).
For type-2 (context-free) you will need a pushdown automaton (or any of the above).
And only type-3 (regular) can be read by regular expressions (or any of the above).
You can find further readings e.g. at wikipedia.

Infix arithmetic with precedence is not a regular language. Regular expressions are only good for parsing regular languages. (Modern regex implementations aren't really just regular expressions, and they can in fact parse most context-free languages… but they will take exponential time for some of them, and it's non-trivial to predict which ones.)
But it is a context-free language. See the Wikipedia article on Context-free grammar for a brief explanation. Context-free grammars are good for parsing both regular languages and context-free languages.
However, many languages that are non-regular don't need the full power of CFG.
Two important classes are those that are LL- or LR-parseable (in linear time). (Variants on these, especially LALR and SLR, are also important.) For example, Python can be (and is, at least in the reference implementation) parsed by an LL(1) parser.
Your language fits into an even more restrictive subset of LR(1), OP. In fact, as the name implies ("OP" is short for "Operator Precedence"), it's the paradigm case. And OP parsers are much easier to write by hand than more general parsers. So, if you're going to build a custom parser from scratch, that's what you'd probably want to use here.

Related

(python - cpp) - How to split the c++ codes while writing a lexical analyzer in python?

I wrote a lexical analyzer for cpp codes in python, but the problem is when I use input.split(" ") it won't recognize codes like x=2 or function() as three different tokens unless I add an space between them manually, like: x = 2 .
also it fails to recognize the tokens at the beginning of each line.
(if i add spaces between each two tokens and also at the beginning of each line, my code works correctly)
I tried splitting the code first by lines then by space but it got complicated and still I wasn't able to solve the first problem.
Also I thought about splitting it by operators, yet I couldn't actually implement it. plus I need the operators to be recognized as tokens as well, so this might not be a good idea.
I would appreciate it if anyone could give any solution or suggestion, Thank You.
f=open("code.txt")
input=f.read()
input=input.split(" ")
f=open("code.txt")
input=f.read()
input1=input.split("\n")
for var in input1:
var=var.split(" ")

Obviously, if you try to have success splitting such an expression like x=2 and also x = 2... it seems pretty obvious that isn't going to work.
What you are looking is for a solution that works with both right?
Basic solution is to use an and operator, and use the conditions that you need to parse. Note that this solution isn't scalable, neither fits into the category of good practices, but it can help you to figure out better but harder solutions.
if input.split(' ') and input.split('='):
An intermediate solution would be to use regex.
Regex isn't an easy topic, but you can checkout online documentation, and then you have wonderful online tools to check your regex codes.
Regex 101
The last one, would be to convert your input data into an AST, which stands for abstract syntax tree. This is the technique employed by C++ compilers like, for example, Clang.
This last one is a real hard topic, so for figure out a basic lexer, probably will be really time consuming, but maybe it could fit your needs.

The usual approach is to scan the incoming text from left to right. At each character position, the lexical analyser selects the longest string which fits some pattern for a "lexeme", which is either a token or ignored input (whitespace and comments, for example). Then the scan continues at the next character.
Lexical patterns are often described using regular expressions, but the standard regular expression module re is not as much help as it could be for this procedure, because it does not have the facility of checking multiple regular expressions in parallel. (And neither does the possible future replacement, the regex module.) Or, more precisely, the library can check multiple expressions in parallel (using alternation syntax, (...|...|...)), but it lacks an interface which can report which of the alternatives was matched. [Note 1]. So it would be necessary to try every possible pattern one at a time and select whichever one turns out to have the longest match.
Note that the matches are always anchored at the current input point; the lexical analyser does not search for a matching pattern. Every input character becomes part of some lexeme, even if that lexeme is ignored, and lexemes do not overlap.
You can write such an analyser by hand for a simple language, but C++ is hardly a simple language. Hand-built lexical analysers most certainly exist, but all the ones I've seen are thousands of lines of not very readable code. So it's usually easier to build an analyzer automatically using software designed for that purpose. These have been around for a long time -- Lex was written almost 50 years ago, for example -- and if you are planning on writing more than one lexical analyser, you would be well advised to investigate some of the available tools.
Notes
The PCRE2 and Oniguruma regex libraries provide a "callout" feature which I believe could be used for this purpose. I haven't actually seen it used in lexical analysis, but it's a fairly recent addition, particularly for Oniguruma, and as far as I can see, the Python bindings for those two libraries do not wrap the callout feature. (Although, as usual with Python bindings to C libraries, documentation is almost non-existent, so I can't say for certain.)

Exponential running time for simple Python regex [duplicate]

It is possible to write a Regex which needs in some cases exponential running time. Such an example is (aa|aa)*. If there is an input of an odd number of as it needs exponential running time.
It is easy to test this. If the input contains only as and has length 51, the Regex needs some seconds to compute (on my machine). Instead if the input length is 52 its computing time is not noticeable (I tested this with the built-in Regex-parser of the JavaRE).
I have written a Regex-parser to find the reason for this behavior, but I didn't find it. My parser can build an AST or a NFA based on a Regex. After that it can translate the NFA to a DFA. To do this it uses the powerset construction algorithm.
When I parse the Rgex mentioned above, the parser creates a NFA with 7 states - after conversion there are only 3 states left in the DFA. The DFA represents the more sensible Regex (aa)*, which can be parsed very fast.
Thus, I don't understand why there are parsers which can be so slow. What is the reason for this? Do they not translate the NFA to a DFA? If yes, why not? And what's the technical reasons why they compute so slow?

Russ Cox has a very detailed article about why this is and the history of regexes (part 2, part 3).
Regular expression matching can be simple and fast, using finite automata-based techniques that have been known for decades. In contrast, Perl, PCRE, Python, Ruby, Java, and many other languages have regular expression implementations based on recursive backtracking that are simple but can be excruciatingly slow. With the exception of backreferences, the features provided by the slow backtracking implementations can be provided by the automata-based implementations at dramatically faster, more consistent speeds.
Largely, it comes down to proliferation of non-regular features in "regular" expressions such as backreferences, and the (continued) ignorance of most programmers that there are better alternatives for regexes that do not contain such features (which is many of them).
While writing the text editor sam in the early 1980s, Rob Pike wrote a new regular expression implementation, which Dave Presotto extracted into a library that appeared in the Eighth Edition. Pike's implementation incorporated submatch tracking into an efficient NFA simulation but, like the rest of the Eighth Edition source, was not widely distributed. Pike himself did not realize that his technique was anything new. Henry Spencer reimplemented the Eighth Edition library interface from scratch, but using backtracking, and released his implementation into the public domain. It became very widely used, eventually serving as the basis for the slow regular expression implementations mentioned earlier: Perl, PCRE, Python, and so on. (In his defense, Spencer knew the routines could be slow, and he didn't know that a more efficient algorithm existed. He even warned in the documentation, “Many users have found the speed perfectly adequate, although replacing the insides of egrep with this code would be a mistake.”) Pike's regular expression implementation, extended to support Unicode, was made freely available with sam in late 1992, but the particularly efficient regular expression search algorithm went unnoticed.

Regular expressions conforming to this formal definition are computable in linear time, because they have corresponding finite automatas. They are built only from parentheses, alternative | (sometimes called sum), Kleene star * and concatenation.
Extending regular expressions by adding, for example, backward references can lead even to NP-complete regular expressions.
Here you can find an example of regular expression recognizing non-prime numbers.
I guess that, such an extended implementation can have non-linear matching time even in simple cases.
I made a quick experiment in Perl and your regular expression computes equally fast for odd and even number of 'a's.

Tiny DSL implementation in Python

I am researching to implement a DSL in Python and i am looking for a small DSL language that is friendly for someone who had no experience with designing and implementing languages. So far, i reviewed two implementations which are Hy and Mochi. Hy is actually a dialect of lisp and Mochi seems to be very similar to Elixir. Both are complex for me, right now as my aim is to prototype the language and play around with in in order to find if it really helps in solving the problem and fits to the style that problem requires or not. I am aware that Python has good support via language tools provided in Standard library. So far i implemented a dialect of lisp which is very simple indeed, i did not used python AST whatsoever and it is purely implemented via string processing which is absolutely not flexible for what i am looking for.
Are there any implementations rather than two languages mentioned above, small enough to be studied ?
What are some good books ( practical in a sense that does not only sticks to theoritical and academic aspect ) on this subject ?
What would be a good way into studying Python AST and using it ?
Are there any significant performance issues related to languages built upon Python ( like Hy ) in terms of being overhead on the actual produced bytecode ?
Thanks

You can split the task of creating a (yet another!) new language in at least two big steps:
Syntax
Semantics & Interpretation
Syntax
You need to define a grammar for your language, with production rules that specify how to create complex expressions from simple ones.
Example: syntax for LISP:
expression ::= atom | list
atom ::= number | symbol
number ::= [+-]?['0'-'9']+
symbol ::= ['A'-'Z''a'-'z'].*
list ::= '(' expression* ')'
How to read it: an expression is either an atom or a list; an atom is a number or a symbol; a number is... and so on.
Often you will define also some tokenization rules, because most grammars work at token level, and not at characters level.
Once you defined your grammar, you want a parser that, given a sentence (a program) is able to build the derivation tree, or the abstract syntax tree.
For example, for the expression x=f(y+1)+2, you want to obtain the tree:
There are several parsers (LL, LR, recursive descent, ...). You don't necessarily need to write your language parser by yourself, as there are tools that generate the parser from the grammar specification (LEX & YACC, Flex & Bison, JavaCC, ANTLR; also check this list of parsers available for Python).
If you want to skip the step of designing a new grammar, you may want to start from a simple one, like the grammar of LISP. There is even a LISP parser written in Python in the Pyperplan project. They use it for parsing PDDL, which is a domain specific language for planning that is based on LISP.
Useful readings:
Book: Compilers: Principles, Techniques, and Tools by by Alfred Aho, Jeffrey Ullman, Monica S. Lam, and Ravi Sethi, also known as The Dragon Book (because of the dragon pictured in the cover)
https://en.wikipedia.org/?title=Syntax_(programming_languages)
https://en.wikibooks.org/wiki/Introduction_to_Programming_Languages/Grammars
How to define a grammar for a programming language
https://en.wikipedia.org/wiki/LL_parser
https://en.wikipedia.org/wiki/Recursive_descent_parser
https://en.wikipedia.org/wiki/LALR_parser
https://en.wikipedia.org/wiki/LR_parser
Semantics & Interpretation
Once you have the abstract syntax tree of your program, you want to execute your program. There are several formalisms for specifying the "rules" to execute (pieces of) programs:
Operational semantics: a very popular one. It is classified in two categories:
Small Step Semantics: describe individual steps of computation
Big Step Semantics: describe the overall results of computation
Reduction semantics: a formalism based on lambda calculus
Transition semantics: if you look at your interpreter like a transition system, you can specify its semantics using transition semantics. This is especially useful for programs that do not terminate (i.e. run continuously), like controllers.
Useful readings:
Book: A Structural Approach to Operational Semantics [pdf link] by Gordon D. Plotkin
Book: Structure and Interpretation of Computer Programs by Gerald Jay Sussman and Hal Abelson
Book: Semantics Engineering with PLT Redex (SEwPR) by Matthias Felleisen, Robert Bruce Findler, and Matthew Flatt
https://en.wikipedia.org/wiki/Semantics_(computer_science)
https://en.wikipedia.org/wiki/Operational_semantics
https://en.wikipedia.org/wiki/Transition_system
https://en.wikipedia.org/wiki/Kripke_structure_(model_checking)
https://en.wikipedia.org/wiki/Hoare_logic
https://en.wikipedia.org/wiki/Lambda_calculus

You don't really need to know a lot about parsing to write your own language.
I wrote a library that lets you do just that very easily: https://github.com/erezsh/lark
Here's a blog post by me explaining how to use it to write your own language: http://blog.erezsh.com/how-to-write-a-dsl-in-python-with-lark/
I hope you don't mind my shameless plug, but it seems very relevant to your question.

Python parse mathematical text expression

I was wondering if anyone knew of a good python library for evaluation text-based mathematical expressions. So for example,
>>> evaluate("Three plus nine")
12
>>> evaluate("Eight + two")
10
I've seen similar examples that people have done for numeric values and operators in a string. One method used eval to compute the literal value of the expression. And another method of doing this used regex to parse the text.
If there isn't an existing library that handles this well I will probably end up using a combination of the regex and eval techniques for this. I just want to confirm that there isn't already something like this already out there.

You could try pyparsing, which does general recursive descent parsing. In fact, here is something quite close to your second example.
About your other suggestions.
See here about the security issues of eval (ironically, using it for a calculator).
Fundamentally, regular languages are weaker than pushdown automata languages. You shouldn't try to fight a general parse tree problem with regexes.

Is there a Python equivalent for Perl's `study`?

From Perl's documentation:
study takes extra time to study SCALAR ($_ if unspecified) in anticipation of doing
many pattern matches on the string before it is next modified. This may or may not save
time, depending on the nature and number of patterns you are searching and the distribution
of character frequencies in the string to be searched;
I'm trying to speed up some regular expression-driven parsing that I'm doing in Python, and I remembered this trick from Perl. I realize I'll have to benchmark to determine if there is a speedup, but I can't find an equivalent method in Python.

Perl’s study doesn’t really do much anymore. The regex compiled has gotten a whole, whole lot smarter than it was when study was created.
For example, it compiles alternatives into a trie structure with Aho–Corasick prediction.
Run with perl -Mre=debug to see the sorts of cleverness the regex compiler and execution engine apply.

As far as I know there's nothing like this built into Python. But according to the perldoc:
The way study works is this: a linked list of every character in the
string to be searched is made, so we know, for example, where all the
'k' characters are. From each search string, the rarest character is
selected, based on some static frequency tables constructed from some
C programs and English text. Only those places that contain this
"rarest" character are examined.
This doesn't sound very sophisticated, and you could probably hack together something equivalent yourself.
esmre is kind of vaguely similar. And as #Frg noted, you'll want to use re.compile if you're reusing a single regex (to avoid re-parsing the regex itself over and over).
Or you could use suffix trees (here's one implementation, or here's a C extension with unicode support) or suffix arrays (implementation).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.