How to save an AST generated by ANTLR

How to save an AST generated by ANTLR - python

I have successfully generated an AST using ANTLR in python but I cannot figure out for the life of me how I can save this for later use. The only option I have been able to figure out is to use tree.toStringTree() method, but the output of this is messy and not overly convenient or easy to work with.
How do I save it and what format would be best/easiest to work with and be able to visualise and load it in in the future?
EDIT: I can see in the java documentation there is a DotGenerator() to generate a DOT file of the tree, but I can't find a way to do anything like this in python.

What you are looking for is a serializer/deserializer of the parse tree. Serialization was previously addressed in StackOverflow here. It isn't supported in the runtime (ASAIK) because it is not useful: one can reconstruct the tree very quickly by re-parsing the input. Even if you want to change the tree using a transformation, you can replace the nodes in the tree with sub-trees with node types that don't even exist in your parser, print out the tree, then re-parse to reconstruct the tree with the parse types for your grammar. It only makes sense if parsing with semantic analysis is very slow. So, you should consider the problem carefully.
However, it's not difficult to write a crude serializer/deserializer that does not consider "off-channel" content like spacing or comments. This C# program (which you could adapt to python) is an example that reconstructs the tree using the grammars-v4/sexpression.g4 grammar for a target grammar arithmetic.g4. Using toStringTree(rule-names), the tree is first serialized into a string. (Note, toStringTree() without the parser rule names is difficult to read, that is why I asked.) Then, the s-expression is parsed and a bottom-up reconstruction is performed using an Antlr visitor. Since toStringTree() does not mark the parse tree leaves with the type of the token (e.g., to distinguish between a number versus a symbol), the string is lexed to reconstruct the value. It also uses reflection to create the correct parse tree node type.
Outputting a Dot graph for the parse tree is also easy, which I included in the program, using a top-down recursive visitor. Here, the recursive function outputs each edge to a child for a particular node. Since each node name has to be unique (it's a tree), I added the pre-order tree number for the node to the name.
--Ken

Related

DecisionTreeClassificationModel - how to parse and visualize decision tree in PySpark?

I have a model fitted by DecisionTreeClassifier (class DecisionTreeClassificationModel) and need to parse it's tree nodes in order to visualize a subset or whole tree, but it seems that methods available in PySpark API are very limited.
For example - I'd like to take node N and get its parent or all the leaves.
Would this possible using PySpark API? So far all I can do is to call:
model.toDebugString()
and parse the string to recreate the tree structure.
I saw that Java API provides more options, but I don't know how to use it in PySpark script.
What I also found on the web is that there is a spark-tree-plotting package that even visualizes the tree, but I got some failures when trying to install it (seems that it is not maintained anymore).
I would appreciate any tips on how to efficiently parse the decision tree returned by the model.

Save objects in python

I'm programming an animal guessing game in Python, as a binary tree with animals as leaves and discriminatory questions as intermediate nodes. Leaves and questions are objects. Now I want to be able to save the animals and the intermediate questions as pickle-file.
But I do not know how I can identify the various objects for pickling. Normally you would create an object like so: monkey = Animal('Is it a monkey?') so that you could refer to the object by the name monkey.
But as the tree grows the leaf-object monkey is changed into an intermediate node with question 'Does it like peanuts' with a yes-exit to a new monkey-node, and a no-exit to another (new) animal. So, how do I pickle these objects?

I would utilize a pre-order traversal starting at the root node and traversing down using the pre-order methodology.
Then, when you want to read the file, you can use the same type of traversal to read your tree back to your program.
All of your nodes can be reached from the root node, so these types of traversals are really handy for easy writing and reading of binary search trees. Last year in my Data Structures course I completed a very similar assignment using this method.

SyntaxNet creating tree to root verb

I am new to Python and the world of NLP. The recent announcement of Google's Syntaxnet intrigued me. However I am having a lot of trouble understanding documentation around both syntaxnet and related tools (nltk, etc.)
My goal: given an input such as "Wilbur kicked the ball" I would like to extract the root verb (kicked) and the object it pertains to "the ball".
I stumbled across "spacy.io" and this visualization seems to encapsulate what I am trying to accomplish: POS tag a string, and load it into some sort of tree structure so that I can start at the root verb and traverse the sentence.
I played around with the syntaxnet/demo.sh, and as suggested in this thread commented out the last couple lines to get conll output.
I then loaded this input in a python script (kludged together myself, probably not correct):
import nltk
from nltk.corpus import ConllCorpusReader
columntypes = ['ignore', 'words', 'ignore', 'ignore', 'pos']
corp = ConllCorpusReader('/Users/dgourlay/development/nlp','input.conll', columntypes)
I see that I have access to corp.tagged_words(), but no relationship between the words. Now I am stuck! How can I load this corpus into a tree type structure?
Any help is much appreciated!

This may have been better as a comment, but I don't yet have the required reputation.
I haven't used the ConllCorpusreader before (would you consider uploading the file you are loading to a gist and providing a link? It would be much easier to test), but I wrote a blog post which may help with the tree parsing aspect: here.
In particular, you probably want to chunk each sentence. Chapter 7 of the NLTK book has some more information on this, but this is the example from my blog:
# This grammar is described in the paper by S. N. Kim,
# T. Baldwin, and M.-Y. Kan.
# Evaluating n-gram based evaluation metrics for automatic
# keyphrase extraction.
# Technical report, University of Melbourne, Melbourne 2010.
grammar = r"""
NBAR:
# Nouns and Adjectives, terminated with Nouns
{<NN.*|JJ>*<NN.*>}
NP:
{<NBAR>}
# Above, connected with in/of/etc...
{<NBAR><IN><NBAR>}
"""
chunker = nltk.RegexpParser(grammar)
tree = chunker.parse(postoks)
Note: You could also use a Context Free Grammar (covered in Chapter 8).
Each chunked (or parsed) sentence (or in this example Noun Phrase, according to the grammar above) will be a subtree. To access these subtrees, we can use this function:
def leaves(tree):
"""Finds NP (nounphrase) leaf nodes of a chunk tree."""
for subtree in tree.subtrees(filter = lambda t: t.node=='NP'):
yield subtree.leaves()
Each of the yielded objects will be a list of word-tag pairs. From there you can find the verb.
Next, you could play with the grammar above or the parser. Verbs split noun phrases (see this diagram in Chapter 7), so you can probably just access the first NP after a VBD.
Sorry for the solution not being specific to your problem, but hopefully it is a helpful starting point. If you upload the file(s) I'll take another shot :)

What you are trying to do is to find a dependency, namely dobj. I'm not yet familiar enough with SyntaxNet/Parsey to tell you how exactly to go extracting that dependency from it's output, but I believe this answer might help you. In short, you can configure Parsey to use ConLL syntax for output, parse it into whatever you find easy to traverse, then look for ROOT dependency to find the verb and *obj dependencies to find its objects.

If you have parsed the raw text in the conll format using whatever parser, you can follow the steps to traverse the dependents of a node that you are interested in:
build an adjacency matrix from the output conll sentence
look for the node you are interested in (verb in your case) and extract its dependents from the adjacency matrix (indices)
for each dependent look for its dependency label in the 8th column in the conll format.
PS: I can provide the code, but it would be better if you can code it yourself.

Preserving the order of elements when parsing Microdata in Python

I am facing the following problem:
When I am parsing an HTML document with Microdata markup in Python with rdflib, the ordering of elements is lost (which is a natural consequence of RDF not having an order for multiple elements).
E.g. the value method often returns the element that was the first value in the original markup, but not reliably.
Now, sometimes it will be very handy to preserve the original order. Is there a way to tell rdflib to return an ordered list for multiple values?
Or is there a simple Microdata-to-JSON (or JSON-LD) library for Python?
Thanks!

I actually found a very efficient way: Instead of parsing the Microdata into RDF with rdflib, I used Ed Summer's Microdata library at
https://github.com/edsu/microdata
This preserves the original order and is by far the simplest solution I found.

Parser generation

i am doing a project on SOFWARE PLAGIARISM DETECTION..i am intended to do it with language C..for that i am supposed to create a token generator, and a parser..but i dont know where to start..any one can help me out with this..
i created a database of tokens and i separated the tokens from my program.Next thing i wanna do is to compare two programs to find out whether it's plagiarized or not. For that i need to create a syntax analyzer.I don't know where to start from...
i.e I want to create a parser for c programs in python

If you want to create a parser in Python you can look at these libraries:
PLY
pyparsing
and Lepl - new but very powerful

Building a real C parser by yourself is a really big task.
I suggest you either find one that is already done, eg. pycparser or you define a really simple subset of C that is easily parsed.
You'll have plenty of work to do for your plagiarism detector after you are done parsing C.

I'm not sure you need to parse the token stream to detect the features you're looking for. In fact, it's probably going to complicate things more than anything.
what you're really looking for is sequences of original source code that have a very strong similarity with a suspect sample code being tested. This sounds very similar to the purpose of a Bayes classifier, like those used in spam filtering and language detection.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.