Tiny DSL implementation in Python

Tiny DSL implementation in Python - python

I am researching to implement a DSL in Python and i am looking for a small DSL language that is friendly for someone who had no experience with designing and implementing languages. So far, i reviewed two implementations which are Hy and Mochi. Hy is actually a dialect of lisp and Mochi seems to be very similar to Elixir. Both are complex for me, right now as my aim is to prototype the language and play around with in in order to find if it really helps in solving the problem and fits to the style that problem requires or not. I am aware that Python has good support via language tools provided in Standard library. So far i implemented a dialect of lisp which is very simple indeed, i did not used python AST whatsoever and it is purely implemented via string processing which is absolutely not flexible for what i am looking for.
Are there any implementations rather than two languages mentioned above, small enough to be studied ?
What are some good books ( practical in a sense that does not only sticks to theoritical and academic aspect ) on this subject ?
What would be a good way into studying Python AST and using it ?
Are there any significant performance issues related to languages built upon Python ( like Hy ) in terms of being overhead on the actual produced bytecode ?
Thanks

You can split the task of creating a (yet another!) new language in at least two big steps:
Syntax
Semantics & Interpretation
Syntax
You need to define a grammar for your language, with production rules that specify how to create complex expressions from simple ones.
Example: syntax for LISP:
expression ::= atom | list
atom ::= number | symbol
number ::= [+-]?['0'-'9']+
symbol ::= ['A'-'Z''a'-'z'].*
list ::= '(' expression* ')'
How to read it: an expression is either an atom or a list; an atom is a number or a symbol; a number is... and so on.
Often you will define also some tokenization rules, because most grammars work at token level, and not at characters level.
Once you defined your grammar, you want a parser that, given a sentence (a program) is able to build the derivation tree, or the abstract syntax tree.
For example, for the expression x=f(y+1)+2, you want to obtain the tree:
There are several parsers (LL, LR, recursive descent, ...). You don't necessarily need to write your language parser by yourself, as there are tools that generate the parser from the grammar specification (LEX & YACC, Flex & Bison, JavaCC, ANTLR; also check this list of parsers available for Python).
If you want to skip the step of designing a new grammar, you may want to start from a simple one, like the grammar of LISP. There is even a LISP parser written in Python in the Pyperplan project. They use it for parsing PDDL, which is a domain specific language for planning that is based on LISP.
Useful readings:
Book: Compilers: Principles, Techniques, and Tools by by Alfred Aho, Jeffrey Ullman, Monica S. Lam, and Ravi Sethi, also known as The Dragon Book (because of the dragon pictured in the cover)
https://en.wikipedia.org/?title=Syntax_(programming_languages)
https://en.wikibooks.org/wiki/Introduction_to_Programming_Languages/Grammars
How to define a grammar for a programming language
https://en.wikipedia.org/wiki/LL_parser
https://en.wikipedia.org/wiki/Recursive_descent_parser
https://en.wikipedia.org/wiki/LALR_parser
https://en.wikipedia.org/wiki/LR_parser
Semantics & Interpretation
Once you have the abstract syntax tree of your program, you want to execute your program. There are several formalisms for specifying the "rules" to execute (pieces of) programs:
Operational semantics: a very popular one. It is classified in two categories:
Small Step Semantics: describe individual steps of computation
Big Step Semantics: describe the overall results of computation
Reduction semantics: a formalism based on lambda calculus
Transition semantics: if you look at your interpreter like a transition system, you can specify its semantics using transition semantics. This is especially useful for programs that do not terminate (i.e. run continuously), like controllers.
Useful readings:
Book: A Structural Approach to Operational Semantics [pdf link] by Gordon D. Plotkin
Book: Structure and Interpretation of Computer Programs by Gerald Jay Sussman and Hal Abelson
Book: Semantics Engineering with PLT Redex (SEwPR) by Matthias Felleisen, Robert Bruce Findler, and Matthew Flatt
https://en.wikipedia.org/wiki/Semantics_(computer_science)
https://en.wikipedia.org/wiki/Operational_semantics
https://en.wikipedia.org/wiki/Transition_system
https://en.wikipedia.org/wiki/Kripke_structure_(model_checking)
https://en.wikipedia.org/wiki/Hoare_logic
https://en.wikipedia.org/wiki/Lambda_calculus

You don't really need to know a lot about parsing to write your own language.
I wrote a library that lets you do just that very easily: https://github.com/erezsh/lark
Here's a blog post by me explaining how to use it to write your own language: http://blog.erezsh.com/how-to-write-a-dsl-in-python-with-lark/
I hope you don't mind my shameless plug, but it seems very relevant to your question.

Related

Exponential running time for simple Python regex [duplicate]

It is possible to write a Regex which needs in some cases exponential running time. Such an example is (aa|aa)*. If there is an input of an odd number of as it needs exponential running time.
It is easy to test this. If the input contains only as and has length 51, the Regex needs some seconds to compute (on my machine). Instead if the input length is 52 its computing time is not noticeable (I tested this with the built-in Regex-parser of the JavaRE).
I have written a Regex-parser to find the reason for this behavior, but I didn't find it. My parser can build an AST or a NFA based on a Regex. After that it can translate the NFA to a DFA. To do this it uses the powerset construction algorithm.
When I parse the Rgex mentioned above, the parser creates a NFA with 7 states - after conversion there are only 3 states left in the DFA. The DFA represents the more sensible Regex (aa)*, which can be parsed very fast.
Thus, I don't understand why there are parsers which can be so slow. What is the reason for this? Do they not translate the NFA to a DFA? If yes, why not? And what's the technical reasons why they compute so slow?

Russ Cox has a very detailed article about why this is and the history of regexes (part 2, part 3).
Regular expression matching can be simple and fast, using finite automata-based techniques that have been known for decades. In contrast, Perl, PCRE, Python, Ruby, Java, and many other languages have regular expression implementations based on recursive backtracking that are simple but can be excruciatingly slow. With the exception of backreferences, the features provided by the slow backtracking implementations can be provided by the automata-based implementations at dramatically faster, more consistent speeds.
Largely, it comes down to proliferation of non-regular features in "regular" expressions such as backreferences, and the (continued) ignorance of most programmers that there are better alternatives for regexes that do not contain such features (which is many of them).
While writing the text editor sam in the early 1980s, Rob Pike wrote a new regular expression implementation, which Dave Presotto extracted into a library that appeared in the Eighth Edition. Pike's implementation incorporated submatch tracking into an efficient NFA simulation but, like the rest of the Eighth Edition source, was not widely distributed. Pike himself did not realize that his technique was anything new. Henry Spencer reimplemented the Eighth Edition library interface from scratch, but using backtracking, and released his implementation into the public domain. It became very widely used, eventually serving as the basis for the slow regular expression implementations mentioned earlier: Perl, PCRE, Python, and so on. (In his defense, Spencer knew the routines could be slow, and he didn't know that a more efficient algorithm existed. He even warned in the documentation, “Many users have found the speed perfectly adequate, although replacing the insides of egrep with this code would be a mistake.”) Pike's regular expression implementation, extended to support Unicode, was made freely available with sam in late 1992, but the particularly efficient regular expression search algorithm went unnoticed.

Regular expressions conforming to this formal definition are computable in linear time, because they have corresponding finite automatas. They are built only from parentheses, alternative | (sometimes called sum), Kleene star * and concatenation.
Extending regular expressions by adding, for example, backward references can lead even to NP-complete regular expressions.
Here you can find an example of regular expression recognizing non-prime numbers.
I guess that, such an extended implementation can have non-linear matching time even in simple cases.
I made a quick experiment in Perl and your regular expression computes equally fast for odd and even number of 'a's.

Is there a mathematical definition of Python's syntax and semantics

I've been looking for a mathematical definition of Python. Like there is for e.g. relational algebra from IBM. In some scripts of my uni courses, I find something similar but its more of an example language usually to explain some concepts. I would love to find something similar for Python.

Relational algebra? Computer languages like Python are not relational.
Perhaps you can find a grammar for Python.
Then you can feed it to a lexer/parser to get an abstract syntax tree [AST]. Once you have that, you can walk the tree and generate whatever you wish for each node (e.g. byte code, etc.)
This is the bread and butter of compiler design.

How to use python when learning algorithms?

Basically, I am trying to learn some very basic data structures and algorithms using python. However I think when trying to implementing these algorithms, I unknowingly start using python tricks a bit, even as simple as the following which will not be considered tricks by any stretch of imagination
for i, item in enumerate(arr):
# Algo implementation
or
if item in items:
# do something
I don't know what is the general guideline to follow so that I can grasp the algorithm as it's meant to implemented.

It is all right to use Python's techniques to solve problems. The main exception is when Python does something for you and you want to learn how that something was done. One example is Python's heapq--You can't use that directly if your purpose is to understand how the binary heap structure can be used to implement a priority queue. Of course, you could read the source code and learn much that way.
One thing that can help you is to read a data structures and algorithms book that is based on Python. Then you can be assured that Python will not be used to slide over important topics--at least, if the book is any good.
One such book is Problem Solving with Algorithms and Data Structures. Another is Classic Computer Science Problems in Python. The first book is a free PDF download, though I believe there is a more recent edition that is not free. The second is not free but you can get a discount of 40% at the publisher's web site if you use a discount code mentioned in the Talk Python to Me podcast. I am working through the latter book now, as a reminder of the class I took a very long time ago.
As to a recommendation, the price may be a deciding factor for you. The emphases of the two books also differs. The first is older, using more generic Python and not using many of Python's special features. It is also closer to a textbook, going more into depth in its topics. It covers things like execution complexity, for example. The PDF version, however, does not cover as many topics as other versions I have seen. The PDF does not cover graphs (networks), for example, which another version (which I cannot find now) does.
The second is much more recent, using features of Python 3.7 such as type hints. It also is more of an introduction or review. I think I can use "fair use" to quote the relevant section of the book:
Who is this book for?
This book is for both intermediate and experienced programmers.
Experienced programmers who want to deepen their knowledge of Python
will find comfortably familiar problems from their computer science or
programming education. Intermediate programmers will be introduced to
these classic problems in the language of their choice: Python.
Developers getting ready for coding interviews will likely find this
book to be valuable preparation material.
In addition to professional programmers, students enrolled in
undergraduate computer science programs who have an interest in Python
will likely find this book helpful. It makes no attempt to be a
rigorous introduction to data structures and algorithms. This is not a
data structures and algorithms textbook. You will not find proofs or
extensive use of big-O notation within its pages. Instead, it is
positioned as an approachable, hands-on tutorial to the
problem-solving techniques that should be the end product of taking
data structure, algorithm, and artificial intelligence classes.
Once again, knowledge of Python’s syntax and semantics is assumed. A
reader with zero programming experience will get little out of this
book, and a programmer with zero Python experience will almost
certainly struggle. In other words, Classic Computer Science Problems
in Python is a book for working Python programmers and computer
science students.

If you want to understand how an algorithm works, I would strongly recommend to work with flowcharts. They represent the algorithmic procedure as relations between the elementary logical statements the algorithm is made of and are independent on the programming language an algorithm might be implemented.
If you want to learn python along with it, then here is what you can do:
1. Study the flowchart of the algorithm that interests you.
2. Translate that flowchart 1-to-1 into python code.
3. Have a closer look at your python code and try to optimize or compact it.
This can be best illustrated with an example:
1.
Here is the flowchart of Euclid's algorithm that finds the greatest common denominator of two numbers (taken form the wiki page Algorithm):
To understand an algorithm means to be able to follow or even reproduce this flowchart
2.
Now if your goal is to learn python a great exercise is to take a flowchart and translate it to python. No shortcuts, no simplifications, just 1-to-1 as it is written, translate the algorithm to python. You won't be fooled by any tricks or masked complexity when doing so, as the flowchart tells you the elementary logical steps and you are just translating them to your preferred programming language.
For the example above, a crude 1-to-1 implementation looks like this:
def gcd(a,b): # point 1
while True:
if b == 0: # p. 2
return a # p. 8 + 9
if a > b: # p. 3
a = a - b # p. 6
# p. 7
else: # p. 3
b = b - a # p. 4
# p. 5
3.
By now you have both learned how the algorithm works and how you implement logical statements in python. The tricks you mentioned earlier can enter the game here. You can start to play around and try to make the implementation more efficient, more compact or a one-liner (people like this for some reason). This will not only help your logical understanding but it will also deepen your knowledge of the programming language you are using.
As for the example at hand, Euclid's algorithm, there is not a lot of fancy business that comes to my mind. I somehow find recursive calls elegant, so here is a tricky implementation using this:
def gcd(a,b):
if b == 0:
return a
else:
return gcd(a-b,b) if a > b else gcd(a, b - a)
Note the you can (and sometimes even have to) do this procedure in the reverse order. It can happen that the only thing you know about an algorithm is an implementation of it. The you would proceed exactly in the reversed order: 3.->2. Try to identify and 'expand' all trickery that might be present in the implementation. 2.->1. Use the 'expanded' implementation to create a flowchart of the algorithm, in order to have a proper definition.

They are not tricks!
These are the same thing you would do in any other language. It's just made more simpler in python.
In c/c++ you would do,
for(int i=0; i<sizeof(arr)/sizeof(arr[0]); i++) {
// access the array elements here as arr[i]
}
The same thing you would do in python in a bit convenient way i.e.
for i, a in enumerate (arr):
# do something
or
for i in range(len(arr)):
# do something with arr elements
Your algorithms will NOT depend upon these syntactical difference.
Whether it is in python or in c/c++ or in any other language, if you have a good understanding of the language, you are good to go with any thing. You just have to keep in mind the time complexities of the algorithms you use and how you implement them.
The thing with python is that it's way more easy to understand, it's shorter to write, has a lot of inbuilt functions, you need no class or a main function to execute your program and many more.
If you ask me, I would not say they are any tricks. All programming languages have these things in common with just syntactical difference.
It depends on what you are trying to implement. Like say if you are trying to implement linked list, you just need to know what can you use in python to implement that.

library for transforming a node tree

I'd like to be able to express a general transformation of one tree into another without writing a bunch of repetitive spaghetti code. Are there any libraries to help with this problem? My target language is Python, but I'll look at other languages as long as it's feasible to port to Python.
Example: I'd like to transform this node tree: (please excuse the S-expressions)
(A (B) (C) (D))
Into this one:
(C (B) (D))
As long as the parent is A and the second ancestor is C, regardless of context (there may be more parents or ancestors). I'd like to express this transformation in a simple, concise, and re-usable way. Of course this example is very specific. Please try to address the general case.
Edit: RefactoringNG is the kind of thing I'm looking for, although it introduces an entirely new grammar to solve the problem, which i'd like to avoid. I'm still looking for more and/or better examples.
Background:
I'm able to convert python and cheetah (don't ask!) files into tokenized tree representations, and in turn convert those into lxml trees. I plan to then re-organize the tree and write-out the results in order to implement automated refactoring. XSLT seems to be the standard tool to rewrite XML, but the syntax is terrible (in my opinion, obviously) and nobody at our shop would understand it.
I could write some functions which simply use the lxml methods (.xpath and such) to implement my refactorings, but I'm worried that I will wind up with a bunch of purpose-built spaghetti code which can't be re-used.

Let's try this in Python code. I've used strings for the leaves, but this will work with any objects.
def lift_middle_child(in_tree):
(A, (B,), (C,), (D,)) = in_tree
return (C, (B,), (D,))
print lift_middle_child(('A', ('B',), ('C',), ('D',))) # could use lists too
This sort of tree transformation is generally better performed in a functional style - if you create a bunch of these functions, you can explicitly compose them, or create a composition function to work with them in a point-free style.
Because you've used s-expressions, I assume you're comfortable representing trees as nested lists (or the equivalent - unless I'm mistaken, lxml nodes are iterable in that way). Obviously, this example relies on a known input structure, but your question implies that. You can write more flexible functions, and still compose them, as long as they have this uniform interface.
Here's the code in action: http://ideone.com/02Uv0i
Now, here's a function to reverse children, and using that and the above function, one to lift and reverse:
def compose2(a,b): # might want to get this from the functional library
return lambda *x: a(b(*x))
def compose(*funcs): #compose(a,b,c) = a(b(c(x))) - you might want to reverse that
return reduce(compose2,funcs)
def reverse_children(in_tree):
return in_tree[0:1] + in_tree[1:][::-1] # slightly cryptic, but works for anything subscriptable
lift_and_reverse = compose(reverse_children,lift_middle_child) # right most function applied first - if you find this confusing, reverse order in compose function.
print lift_and_reverse(('A', ('B',), ('C',), ('D',)))

What you really want IMHO is an program transformation system, which allows you to parse and transform code using the patterns expressed in the surface syntax of the source code (and even the target language) to express the rewrites directly.
You will find that even if you can get your hands on an XML representation of the Python tree, that the effort to write an XSLT/XPath transformation is more than you expect; trees representing real code are messier than you'd expect, XSLT isn't that convenient a notation, and it cannot express directly common conditions on trees that you'd like to check (e.g., that two subtrees are the same). An final complication with XML: assume its has been transformed. How do you regenerate the source code syntax from which came? You need some kind of prettyprinter.
A general problem regardless of how the code is represented is that without information about scopes and types (where you can get it), writing correct transformations is pretty hard. After all, if you are going to transform python into a language that uses different operators for string concat and arithmetic (unlike Java which uses "+" for both), you need to be able to decide which operator to generate. So you need type information to decide. Python is arguably typeless, but in practice most expressions involve variables which have only one type for their entire lifetime. So you'll also need flow analysis to compute types.
Our DMS Software Reengineering Toolkit has all of these capabilities (parsing, flow analysis, pattern matching/rewriting, prettyprinting), and robust parsers for many languages including Python. (While it has flow analysis capability instantiated for C, COBOL, Java, this is not instantiated for Python. But then, you said you wanted to do the transformation regardless of context).
To express your rewrite in DMS on Python syntax close to your example (which isn't Python?)
domain Python;
rule revise_arguments(f:IDENTIFIER,A:expression,B:expression,
C:expression,D:expression):primary->primary
= " \f(\A,(\B),(\C),(\D)) "
-> " \f(\C,(\B),(\D)) ";
The notation above is the DMS rule-rewriting language (RSL). The "..." are metaquotes that separate Python syntax (inside those quotes, DMS knows it is Python because of the domain notation declaration) from the DMS RSL language. The \n inside the meta quote refers to the syntax variable placeholders of the named nonterminal type defined in the rule parameter list. Yes, (...) inside the metaquotes are Python ( ) ... they exist in the syntax trees as far as DMS is concerned, because they, like the rest of the language, are just syntax.
The above rule looks a bit odd because I'm trying to follow your example as close as possible, and from and expression language point of view, your example is odd precisely because it does have unusual parentheses.
With this rule, DMS could parse Python (using its Python parser) like
foobar(2+3,(x-y),(p),(baz()))
build an AST, match the (parsed-to-AST) rule against that AST, rewrite it to another AST corresponding to:
foobar(p,(x-y),(baz()))
and then prettyprint the surface syntax (valid) python back out.
If you intended your example to be a transformation on LISP code, you'd
need a LISP grammar for DMS (not hard to build, but we don't have much
call for this), and write corresponding surface syntax:
domain Lisp;
rule revise_form(A:form,B:form, C:form, D:form):form->form
= " (\A,(\B),(\C),(\D)) "
-> " (\C,(\B),(\D)) ";
You can get a better feel for this by looking at Algebra as a DMS domain.
If your goal is to implement all this in Python... I don't have much help.
DMS is a pretty big system, and it would be a lot of effort to replicate.

Which programming language has very short context-free Grammar in its formal specification?

What programming language has short and beautiful grammars (in EBNF)?
Some languages are easer to be parsed. Some time ago I have created a simple VHDL parser, but it was very slow. Not because it is implemented completely in Python, but because VHDL grammar (in EBNF) is huge. The EBNF of Python is beautiful but it is not very short.
I suggest that many functional programming languages like LISP have short simple grammars, but I am interested in a more popular simple imperative language like C or Bash.

I haven't compared, but Lua is a language renowned for its simple syntax. The BNF is at the very end of this reference manual: http://www.lua.org/manual/5.1/manual.html .

Assembly languages!
...in general, and particularly for CPUs which have a simple architecture (few instructions, few addressing modes, few registers) have a relatively short grammar.
In fact, specialized processors, such as these found in programmable logic controllers can have a language with even simpler grammars. But then again the most simple of the PLCs are little more than Boolean equation calculators.

One of the simplest imperative languages is Oberon-2. Syntax of Oberon-2.
Also take a look at Oberon-07 (The Programming Language Oberon-07, PDF) and Component Pascal.

Pascal has only 2-3 pages of BNF notations

What about GL Shading language? Language Specification (PDF)
However for these kind of hobbies I always preferred to implement a subset of a known language by myself without choosing anything "premade"..

Lisp is probably pretty small.
lisp ::= `(´ exp `)´

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.