Related
I have been reverse engineering a specific black box equation that is part of a system I do not own (do not worry, it's white-hat), in which you can only measure the inputs (a large set of integers) and outputs (two integers).
This system can only be perfectly described as a program/function in which all the input integers are used, and so far I can perfectly describe the behavior by creating a data structure that has named "mathematical terms" in which each named input integer lives, and each term has an ordering for the inputs that it owns. I also have a function that takes the model description, and a set of named inputs, and outputs two integers. So the mapping of lists of input names to program behavior lives in here and in the model description in tandem.
I've been programming the reverse engineering utility in python, but ultimately I want to output a low level lua program that represents this function in a less abstract manner. When there were less terms in the model, it was simple to manually write a "transpiler" from this model (in python) to lua, but as the complexity grows it's painful to rewrite the code generator for new types of terms, especially in an ad-hoc manner.
From reading other questions about similar systems, it seems the very last two steps of this process would be: generating an abstract syntax tree representing my desired program, and giving the ast to a lua prettyprinter to generate the code. But I'm not sure if there's useful abstractions that I'm unaware of that help me generate a lua ast, given my current description of the model.
What you're looking for is an abstract syntax tree, which can define the behavior of a program through a graph. Since each component of an abstract syntax tree is highly compartmentalized (eg, "Add", "Number Constant"...) is it extraordinarily efficient to translate an abstract syntax tree back into a high-level programming language, such as Lua.
Abstract syntax trees are used in many compilers and transpilers, so you will not be digging long to find good examples.
CSharp.Lua does a similar thing to what you want; transpiling C# to Lua using a simple abstract syntax tree and a slightly less simple code generator.
Speedy Web Compiler contains an excellent implementation of a javascript code generator
ESBuild also has a well-done implementation for javascript.
Basically, I am trying to learn some very basic data structures and algorithms using python. However I think when trying to implementing these algorithms, I unknowingly start using python tricks a bit, even as simple as the following which will not be considered tricks by any stretch of imagination
for i, item in enumerate(arr):
# Algo implementation
or
if item in items:
# do something
I don't know what is the general guideline to follow so that I can grasp the algorithm as it's meant to implemented.
It is all right to use Python's techniques to solve problems. The main exception is when Python does something for you and you want to learn how that something was done. One example is Python's heapq--You can't use that directly if your purpose is to understand how the binary heap structure can be used to implement a priority queue. Of course, you could read the source code and learn much that way.
One thing that can help you is to read a data structures and algorithms book that is based on Python. Then you can be assured that Python will not be used to slide over important topics--at least, if the book is any good.
One such book is Problem Solving with Algorithms and Data Structures. Another is Classic Computer Science Problems in Python. The first book is a free PDF download, though I believe there is a more recent edition that is not free. The second is not free but you can get a discount of 40% at the publisher's web site if you use a discount code mentioned in the Talk Python to Me podcast. I am working through the latter book now, as a reminder of the class I took a very long time ago.
As to a recommendation, the price may be a deciding factor for you. The emphases of the two books also differs. The first is older, using more generic Python and not using many of Python's special features. It is also closer to a textbook, going more into depth in its topics. It covers things like execution complexity, for example. The PDF version, however, does not cover as many topics as other versions I have seen. The PDF does not cover graphs (networks), for example, which another version (which I cannot find now) does.
The second is much more recent, using features of Python 3.7 such as type hints. It also is more of an introduction or review. I think I can use "fair use" to quote the relevant section of the book:
Who is this book for?
This book is for both intermediate and experienced programmers.
Experienced programmers who want to deepen their knowledge of Python
will find comfortably familiar problems from their computer science or
programming education. Intermediate programmers will be introduced to
these classic problems in the language of their choice: Python.
Developers getting ready for coding interviews will likely find this
book to be valuable preparation material.
In addition to professional programmers, students enrolled in
undergraduate computer science programs who have an interest in Python
will likely find this book helpful. It makes no attempt to be a
rigorous introduction to data structures and algorithms. This is not a
data structures and algorithms textbook. You will not find proofs or
extensive use of big-O notation within its pages. Instead, it is
positioned as an approachable, hands-on tutorial to the
problem-solving techniques that should be the end product of taking
data structure, algorithm, and artificial intelligence classes.
Once again, knowledge of Python’s syntax and semantics is assumed. A
reader with zero programming experience will get little out of this
book, and a programmer with zero Python experience will almost
certainly struggle. In other words, Classic Computer Science Problems
in Python is a book for working Python programmers and computer
science students.
If you want to understand how an algorithm works, I would strongly recommend to work with flowcharts. They represent the algorithmic procedure as relations between the elementary logical statements the algorithm is made of and are independent on the programming language an algorithm might be implemented.
If you want to learn python along with it, then here is what you can do:
1. Study the flowchart of the algorithm that interests you.
2. Translate that flowchart 1-to-1 into python code.
3. Have a closer look at your python code and try to optimize or compact it.
This can be best illustrated with an example:
1.
Here is the flowchart of Euclid's algorithm that finds the greatest common denominator of two numbers (taken form the wiki page Algorithm):
To understand an algorithm means to be able to follow or even reproduce this flowchart
2.
Now if your goal is to learn python a great exercise is to take a flowchart and translate it to python. No shortcuts, no simplifications, just 1-to-1 as it is written, translate the algorithm to python. You won't be fooled by any tricks or masked complexity when doing so, as the flowchart tells you the elementary logical steps and you are just translating them to your preferred programming language.
For the example above, a crude 1-to-1 implementation looks like this:
def gcd(a,b): # point 1
while True:
if b == 0: # p. 2
return a # p. 8 + 9
if a > b: # p. 3
a = a - b # p. 6
# p. 7
else: # p. 3
b = b - a # p. 4
# p. 5
3.
By now you have both learned how the algorithm works and how you implement logical statements in python. The tricks you mentioned earlier can enter the game here. You can start to play around and try to make the implementation more efficient, more compact or a one-liner (people like this for some reason). This will not only help your logical understanding but it will also deepen your knowledge of the programming language you are using.
As for the example at hand, Euclid's algorithm, there is not a lot of fancy business that comes to my mind. I somehow find recursive calls elegant, so here is a tricky implementation using this:
def gcd(a,b):
if b == 0:
return a
else:
return gcd(a-b,b) if a > b else gcd(a, b - a)
Note the you can (and sometimes even have to) do this procedure in the reverse order. It can happen that the only thing you know about an algorithm is an implementation of it. The you would proceed exactly in the reversed order: 3.->2. Try to identify and 'expand' all trickery that might be present in the implementation. 2.->1. Use the 'expanded' implementation to create a flowchart of the algorithm, in order to have a proper definition.
They are not tricks!
These are the same thing you would do in any other language. It's just made more simpler in python.
In c/c++ you would do,
for(int i=0; i<sizeof(arr)/sizeof(arr[0]); i++) {
// access the array elements here as arr[i]
}
The same thing you would do in python in a bit convenient way i.e.
for i, a in enumerate (arr):
# do something
or
for i in range(len(arr)):
# do something with arr elements
Your algorithms will NOT depend upon these syntactical difference.
Whether it is in python or in c/c++ or in any other language, if you have a good understanding of the language, you are good to go with any thing. You just have to keep in mind the time complexities of the algorithms you use and how you implement them.
The thing with python is that it's way more easy to understand, it's shorter to write, has a lot of inbuilt functions, you need no class or a main function to execute your program and many more.
If you ask me, I would not say they are any tricks. All programming languages have these things in common with just syntactical difference.
It depends on what you are trying to implement. Like say if you are trying to implement linked list, you just need to know what can you use in python to implement that.
I am researching to implement a DSL in Python and i am looking for a small DSL language that is friendly for someone who had no experience with designing and implementing languages. So far, i reviewed two implementations which are Hy and Mochi. Hy is actually a dialect of lisp and Mochi seems to be very similar to Elixir. Both are complex for me, right now as my aim is to prototype the language and play around with in in order to find if it really helps in solving the problem and fits to the style that problem requires or not. I am aware that Python has good support via language tools provided in Standard library. So far i implemented a dialect of lisp which is very simple indeed, i did not used python AST whatsoever and it is purely implemented via string processing which is absolutely not flexible for what i am looking for.
Are there any implementations rather than two languages mentioned above, small enough to be studied ?
What are some good books ( practical in a sense that does not only sticks to theoritical and academic aspect ) on this subject ?
What would be a good way into studying Python AST and using it ?
Are there any significant performance issues related to languages built upon Python ( like Hy ) in terms of being overhead on the actual produced bytecode ?
Thanks
You can split the task of creating a (yet another!) new language in at least two big steps:
Syntax
Semantics & Interpretation
Syntax
You need to define a grammar for your language, with production rules that specify how to create complex expressions from simple ones.
Example: syntax for LISP:
expression ::= atom | list
atom ::= number | symbol
number ::= [+-]?['0'-'9']+
symbol ::= ['A'-'Z''a'-'z'].*
list ::= '(' expression* ')'
How to read it: an expression is either an atom or a list; an atom is a number or a symbol; a number is... and so on.
Often you will define also some tokenization rules, because most grammars work at token level, and not at characters level.
Once you defined your grammar, you want a parser that, given a sentence (a program) is able to build the derivation tree, or the abstract syntax tree.
For example, for the expression x=f(y+1)+2, you want to obtain the tree:
There are several parsers (LL, LR, recursive descent, ...). You don't necessarily need to write your language parser by yourself, as there are tools that generate the parser from the grammar specification (LEX & YACC, Flex & Bison, JavaCC, ANTLR; also check this list of parsers available for Python).
If you want to skip the step of designing a new grammar, you may want to start from a simple one, like the grammar of LISP. There is even a LISP parser written in Python in the Pyperplan project. They use it for parsing PDDL, which is a domain specific language for planning that is based on LISP.
Useful readings:
Book: Compilers: Principles, Techniques, and Tools by by Alfred Aho, Jeffrey Ullman, Monica S. Lam, and Ravi Sethi, also known as The Dragon Book (because of the dragon pictured in the cover)
https://en.wikipedia.org/?title=Syntax_(programming_languages)
https://en.wikibooks.org/wiki/Introduction_to_Programming_Languages/Grammars
How to define a grammar for a programming language
https://en.wikipedia.org/wiki/LL_parser
https://en.wikipedia.org/wiki/Recursive_descent_parser
https://en.wikipedia.org/wiki/LALR_parser
https://en.wikipedia.org/wiki/LR_parser
Semantics & Interpretation
Once you have the abstract syntax tree of your program, you want to execute your program. There are several formalisms for specifying the "rules" to execute (pieces of) programs:
Operational semantics: a very popular one. It is classified in two categories:
Small Step Semantics: describe individual steps of computation
Big Step Semantics: describe the overall results of computation
Reduction semantics: a formalism based on lambda calculus
Transition semantics: if you look at your interpreter like a transition system, you can specify its semantics using transition semantics. This is especially useful for programs that do not terminate (i.e. run continuously), like controllers.
Useful readings:
Book: A Structural Approach to Operational Semantics [pdf link] by Gordon D. Plotkin
Book: Structure and Interpretation of Computer Programs by Gerald Jay Sussman and Hal Abelson
Book: Semantics Engineering with PLT Redex (SEwPR) by Matthias Felleisen, Robert Bruce Findler, and Matthew Flatt
https://en.wikipedia.org/wiki/Semantics_(computer_science)
https://en.wikipedia.org/wiki/Operational_semantics
https://en.wikipedia.org/wiki/Transition_system
https://en.wikipedia.org/wiki/Kripke_structure_(model_checking)
https://en.wikipedia.org/wiki/Hoare_logic
https://en.wikipedia.org/wiki/Lambda_calculus
You don't really need to know a lot about parsing to write your own language.
I wrote a library that lets you do just that very easily: https://github.com/erezsh/lark
Here's a blog post by me explaining how to use it to write your own language: http://blog.erezsh.com/how-to-write-a-dsl-in-python-with-lark/
I hope you don't mind my shameless plug, but it seems very relevant to your question.
I'd like to be able to express a general transformation of one tree into another without writing a bunch of repetitive spaghetti code. Are there any libraries to help with this problem? My target language is Python, but I'll look at other languages as long as it's feasible to port to Python.
Example: I'd like to transform this node tree: (please excuse the S-expressions)
(A (B) (C) (D))
Into this one:
(C (B) (D))
As long as the parent is A and the second ancestor is C, regardless of context (there may be more parents or ancestors). I'd like to express this transformation in a simple, concise, and re-usable way. Of course this example is very specific. Please try to address the general case.
Edit: RefactoringNG is the kind of thing I'm looking for, although it introduces an entirely new grammar to solve the problem, which i'd like to avoid. I'm still looking for more and/or better examples.
Background:
I'm able to convert python and cheetah (don't ask!) files into tokenized tree representations, and in turn convert those into lxml trees. I plan to then re-organize the tree and write-out the results in order to implement automated refactoring. XSLT seems to be the standard tool to rewrite XML, but the syntax is terrible (in my opinion, obviously) and nobody at our shop would understand it.
I could write some functions which simply use the lxml methods (.xpath and such) to implement my refactorings, but I'm worried that I will wind up with a bunch of purpose-built spaghetti code which can't be re-used.
Let's try this in Python code. I've used strings for the leaves, but this will work with any objects.
def lift_middle_child(in_tree):
(A, (B,), (C,), (D,)) = in_tree
return (C, (B,), (D,))
print lift_middle_child(('A', ('B',), ('C',), ('D',))) # could use lists too
This sort of tree transformation is generally better performed in a functional style - if you create a bunch of these functions, you can explicitly compose them, or create a composition function to work with them in a point-free style.
Because you've used s-expressions, I assume you're comfortable representing trees as nested lists (or the equivalent - unless I'm mistaken, lxml nodes are iterable in that way). Obviously, this example relies on a known input structure, but your question implies that. You can write more flexible functions, and still compose them, as long as they have this uniform interface.
Here's the code in action: http://ideone.com/02Uv0i
Now, here's a function to reverse children, and using that and the above function, one to lift and reverse:
def compose2(a,b): # might want to get this from the functional library
return lambda *x: a(b(*x))
def compose(*funcs): #compose(a,b,c) = a(b(c(x))) - you might want to reverse that
return reduce(compose2,funcs)
def reverse_children(in_tree):
return in_tree[0:1] + in_tree[1:][::-1] # slightly cryptic, but works for anything subscriptable
lift_and_reverse = compose(reverse_children,lift_middle_child) # right most function applied first - if you find this confusing, reverse order in compose function.
print lift_and_reverse(('A', ('B',), ('C',), ('D',)))
What you really want IMHO is an program transformation system, which allows you to parse and transform code using the patterns expressed in the surface syntax of the source code (and even the target language) to express the rewrites directly.
You will find that even if you can get your hands on an XML representation of the Python tree, that the effort to write an XSLT/XPath transformation is more than you expect; trees representing real code are messier than you'd expect, XSLT isn't that convenient a notation, and it cannot express directly common conditions on trees that you'd like to check (e.g., that two subtrees are the same). An final complication with XML: assume its has been transformed. How do you regenerate the source code syntax from which came? You need some kind of prettyprinter.
A general problem regardless of how the code is represented is that without information about scopes and types (where you can get it), writing correct transformations is pretty hard. After all, if you are going to transform python into a language that uses different operators for string concat and arithmetic (unlike Java which uses "+" for both), you need to be able to decide which operator to generate. So you need type information to decide. Python is arguably typeless, but in practice most expressions involve variables which have only one type for their entire lifetime. So you'll also need flow analysis to compute types.
Our DMS Software Reengineering Toolkit has all of these capabilities (parsing, flow analysis, pattern matching/rewriting, prettyprinting), and robust parsers for many languages including Python. (While it has flow analysis capability instantiated for C, COBOL, Java, this is not instantiated for Python. But then, you said you wanted to do the transformation regardless of context).
To express your rewrite in DMS on Python syntax close to your example (which isn't Python?)
domain Python;
rule revise_arguments(f:IDENTIFIER,A:expression,B:expression,
C:expression,D:expression):primary->primary
= " \f(\A,(\B),(\C),(\D)) "
-> " \f(\C,(\B),(\D)) ";
The notation above is the DMS rule-rewriting language (RSL). The "..." are metaquotes that separate Python syntax (inside those quotes, DMS knows it is Python because of the domain notation declaration) from the DMS RSL language. The \n inside the meta quote refers to the syntax variable placeholders of the named nonterminal type defined in the rule parameter list. Yes, (...) inside the metaquotes are Python ( ) ... they exist in the syntax trees as far as DMS is concerned, because they, like the rest of the language, are just syntax.
The above rule looks a bit odd because I'm trying to follow your example as close as possible, and from and expression language point of view, your example is odd precisely because it does have unusual parentheses.
With this rule, DMS could parse Python (using its Python parser) like
foobar(2+3,(x-y),(p),(baz()))
build an AST, match the (parsed-to-AST) rule against that AST, rewrite it to another AST corresponding to:
foobar(p,(x-y),(baz()))
and then prettyprint the surface syntax (valid) python back out.
If you intended your example to be a transformation on LISP code, you'd
need a LISP grammar for DMS (not hard to build, but we don't have much
call for this), and write corresponding surface syntax:
domain Lisp;
rule revise_form(A:form,B:form, C:form, D:form):form->form
= " (\A,(\B),(\C),(\D)) "
-> " (\C,(\B),(\D)) ";
You can get a better feel for this by looking at Algebra as a DMS domain.
If your goal is to implement all this in Python... I don't have much help.
DMS is a pretty big system, and it would be a lot of effort to replicate.
What programming language has short and beautiful grammars (in EBNF)?
Some languages are easer to be parsed. Some time ago I have created a simple VHDL parser, but it was very slow. Not because it is implemented completely in Python, but because VHDL grammar (in EBNF) is huge. The EBNF of Python is beautiful but it is not very short.
I suggest that many functional programming languages like LISP have short simple grammars, but I am interested in a more popular simple imperative language like C or Bash.
I haven't compared, but Lua is a language renowned for its simple syntax. The BNF is at the very end of this reference manual: http://www.lua.org/manual/5.1/manual.html .
Assembly languages!
...in general, and particularly for CPUs which have a simple architecture (few instructions, few addressing modes, few registers) have a relatively short grammar.
In fact, specialized processors, such as these found in programmable logic controllers can have a language with even simpler grammars. But then again the most simple of the PLCs are little more than Boolean equation calculators.
One of the simplest imperative languages is Oberon-2. Syntax of Oberon-2.
Also take a look at Oberon-07 (The Programming Language Oberon-07, PDF) and Component Pascal.
Pascal has only 2-3 pages of BNF notations
What about GL Shading language? Language Specification (PDF)
However for these kind of hobbies I always preferred to implement a subset of a known language by myself without choosing anything "premade"..
Lisp is probably pretty small.
lisp ::= `(´ exp `)´