SyntaxNet creating tree to root verb

SyntaxNet creating tree to root verb - python

I am new to Python and the world of NLP. The recent announcement of Google's Syntaxnet intrigued me. However I am having a lot of trouble understanding documentation around both syntaxnet and related tools (nltk, etc.)
My goal: given an input such as "Wilbur kicked the ball" I would like to extract the root verb (kicked) and the object it pertains to "the ball".
I stumbled across "spacy.io" and this visualization seems to encapsulate what I am trying to accomplish: POS tag a string, and load it into some sort of tree structure so that I can start at the root verb and traverse the sentence.
I played around with the syntaxnet/demo.sh, and as suggested in this thread commented out the last couple lines to get conll output.
I then loaded this input in a python script (kludged together myself, probably not correct):
import nltk
from nltk.corpus import ConllCorpusReader
columntypes = ['ignore', 'words', 'ignore', 'ignore', 'pos']
corp = ConllCorpusReader('/Users/dgourlay/development/nlp','input.conll', columntypes)
I see that I have access to corp.tagged_words(), but no relationship between the words. Now I am stuck! How can I load this corpus into a tree type structure?
Any help is much appreciated!

This may have been better as a comment, but I don't yet have the required reputation.
I haven't used the ConllCorpusreader before (would you consider uploading the file you are loading to a gist and providing a link? It would be much easier to test), but I wrote a blog post which may help with the tree parsing aspect: here.
In particular, you probably want to chunk each sentence. Chapter 7 of the NLTK book has some more information on this, but this is the example from my blog:
# This grammar is described in the paper by S. N. Kim,
# T. Baldwin, and M.-Y. Kan.
# Evaluating n-gram based evaluation metrics for automatic
# keyphrase extraction.
# Technical report, University of Melbourne, Melbourne 2010.
grammar = r"""
NBAR:
# Nouns and Adjectives, terminated with Nouns
{<NN.*|JJ>*<NN.*>}
NP:
{<NBAR>}
# Above, connected with in/of/etc...
{<NBAR><IN><NBAR>}
"""
chunker = nltk.RegexpParser(grammar)
tree = chunker.parse(postoks)
Note: You could also use a Context Free Grammar (covered in Chapter 8).
Each chunked (or parsed) sentence (or in this example Noun Phrase, according to the grammar above) will be a subtree. To access these subtrees, we can use this function:
def leaves(tree):
"""Finds NP (nounphrase) leaf nodes of a chunk tree."""
for subtree in tree.subtrees(filter = lambda t: t.node=='NP'):
yield subtree.leaves()
Each of the yielded objects will be a list of word-tag pairs. From there you can find the verb.
Next, you could play with the grammar above or the parser. Verbs split noun phrases (see this diagram in Chapter 7), so you can probably just access the first NP after a VBD.
Sorry for the solution not being specific to your problem, but hopefully it is a helpful starting point. If you upload the file(s) I'll take another shot :)

What you are trying to do is to find a dependency, namely dobj. I'm not yet familiar enough with SyntaxNet/Parsey to tell you how exactly to go extracting that dependency from it's output, but I believe this answer might help you. In short, you can configure Parsey to use ConLL syntax for output, parse it into whatever you find easy to traverse, then look for ROOT dependency to find the verb and *obj dependencies to find its objects.

If you have parsed the raw text in the conll format using whatever parser, you can follow the steps to traverse the dependents of a node that you are interested in:
build an adjacency matrix from the output conll sentence
look for the node you are interested in (verb in your case) and extract its dependents from the adjacency matrix (indices)
for each dependent look for its dependency label in the 8th column in the conll format.
PS: I can provide the code, but it would be better if you can code it yourself.

Related

How to save an AST generated by ANTLR

I have successfully generated an AST using ANTLR in python but I cannot figure out for the life of me how I can save this for later use. The only option I have been able to figure out is to use tree.toStringTree() method, but the output of this is messy and not overly convenient or easy to work with.
How do I save it and what format would be best/easiest to work with and be able to visualise and load it in in the future?
EDIT: I can see in the java documentation there is a DotGenerator() to generate a DOT file of the tree, but I can't find a way to do anything like this in python.

What you are looking for is a serializer/deserializer of the parse tree. Serialization was previously addressed in StackOverflow here. It isn't supported in the runtime (ASAIK) because it is not useful: one can reconstruct the tree very quickly by re-parsing the input. Even if you want to change the tree using a transformation, you can replace the nodes in the tree with sub-trees with node types that don't even exist in your parser, print out the tree, then re-parse to reconstruct the tree with the parse types for your grammar. It only makes sense if parsing with semantic analysis is very slow. So, you should consider the problem carefully.
However, it's not difficult to write a crude serializer/deserializer that does not consider "off-channel" content like spacing or comments. This C# program (which you could adapt to python) is an example that reconstructs the tree using the grammars-v4/sexpression.g4 grammar for a target grammar arithmetic.g4. Using toStringTree(rule-names), the tree is first serialized into a string. (Note, toStringTree() without the parser rule names is difficult to read, that is why I asked.) Then, the s-expression is parsed and a bottom-up reconstruction is performed using an Antlr visitor. Since toStringTree() does not mark the parse tree leaves with the type of the token (e.g., to distinguish between a number versus a symbol), the string is lexed to reconstruct the value. It also uses reflection to create the correct parse tree node type.
Outputting a Dot graph for the parse tree is also easy, which I included in the program, using a top-down recursive visitor. Here, the recursive function outputs each edge to a child for a particular node. Since each node name has to be unique (it's a tree), I added the pre-order tree number for the node to the name.
--Ken

generating Paraphrases of English text using PPDB

I need to generate paraphrase of an english sentence using the PPDB paraphrase database
I have downloaded the datasets from the website.

I would say your first step needs to be reducing the problem into more manageable components. Second figure out whether you want to paraphrase on a one-to-one, lexical, syntactical, phrase or combination basis. To inform this decision I would take one sentence and paraphrase it myself in order to get an idea of what I am looking for. Next I would start writing a parser for the downloaded data. Then I would remove the stopwords and incorporate a part-of-speech tagger like the ones included in spaCy or nltk for your example phrase.
Since they seem to give you all the information needed to make a successive dictionary filter that is where I would start. I would write a filter which found the parts of speech for each word in my sentence in the [LHS] column of the dataset and select a source that matches the word while minimizing/maximizing the value of 1 feature (like minimizing WordLenDiff) which in the case of "businessnow" <- "business now" = -1.5. Keeping track of the target feature you will then have a basic paraphrased sentence.
using this strategy your output could turn:
"the business uses 4 gb standard."
sent_score = 0
into:
"businessnow uses 4gb standard"
sent_score = -3
After you have a basic example the you can start exploring feature selection algorithms in like those in scikit-learn, etc. and incorporate word alignment. But I would seriously cut down on the scope of the problem and increase it gradually. In the end, how you approach the problem it depends on what the designated use is and how functional it needs to be.
Hope this helps.

Are there any Python NLP tools to figure out how many ways a sentence can be parsed?

I want to be able to measure ambiguity of a sentence, and my current my idea to do so is by measuring how many ways a sentence can be parsed. For example, the sentence "Fruit flies like a banana" can have to interpretations.
So far I have tried using the Stanford Parser, but it only interpreted each sentence in one way. My other idea was to measure how many different parts of speech each word in a sentence could mean, but each POS tagger I found only marked each word with 1 tag even when it could be multiple.
Are there are tools to do either?

From the Stanford Parser FAQ page, hope it helps:
Can I obtain multiple parse trees for a single input sentence?
Yes, for the PCFG parser (only). With a PCFG parser, you can give the option -printPCFGkBest n and it will print the n highest-scoring parses for a sentence. They can be printed either as phrase structure trees or as typed dependencies in the usual way via the -outputFormat option, and each receives a score (log probability). The k best parses are extracted efficiently using the algorithm of Huang and Chiang (2005).

How to calculate confidence score from dependency parse tree?

Is there any way to get confidence score or any score from dependency parse tree of a sentence using ntlk or something else?
Any advice and suggestions will be greatly appreciated!

It's a hard task, I am not aware of any tool doing it, but if you probably post something on the corpora mailing list, or language technology section of reddit you will get better replies. But if was a research question, I would suggest training a PCFG on a penntreebank dataset and then using it to compute the probabilities of parse trees assigned to sentences. You can grab Mark Johnson's implementation. Search for this line:
cky.tbz contains a very fast C implementation of a CKY PCFG parser,
together with programs for extracting PCFGs from treebanks, etc. This
was used in my 1999 CL article. (last updated 6th March, 2006)
CYK (viterbi) is a dynamic programming algorithm. PCFG stands for probabilistic CFG, which you typically train using penntreebank dataset. The summation over the probabilities of all possible parse trees for a sentence can be interpreted as how grammatically correct the sentence is. Sorry if this wasn't the actual answer, but this is a working answer and I can tell you more details if you decided to do it :).

Parsing Meaning from Text

I realize this is a broad topic, but I'm looking for a good primer on parsing meaning from text, ideally in Python. As an example of what I'm looking to do, if a user makes a blog post like:
"Manny Ramirez makes his return for the Dodgers today against the Houston Astros",
what's a light-weight/ easy way of getting the nouns out of a sentence? To start, I think I'd limit it to proper nouns, but I wouldn't want to be limited to just that (and I don't want to rely on a simple regex that assumes anything Title Capped is a proper noun).
To make this question even worse, what are the things I'm not asking that I should be? Do I need a corpus of existing words to get started? What lexical analysis stuff do I need to know to make this work? I did come across one other question on the topic and I'm digging through those resources now.

You need to look at the Natural Language Toolkit, which is for exactly this sort of thing.
This section of the manual looks very relevant: Categorizing and Tagging Words - here's an extract:
>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]
Here we see that and is CC, a coordinating conjunction; now and completely are RB, or adverbs; for is IN, a preposition; something is NN, a noun; and different is JJ, an adjective.

Use the NLTK, in particular chapter 7 on Information Extraction.
You say you want to extract meaning, and there are modules for semantic analysis, but I think IE is all you need--and honestly one of the only areas of NLP computers can handle right now.
See sections 7.5 and 7.6 on the subtopics of Named Entity Recognition (to chunk and categorize Manny Ramerez as a person, Dodgers as a sports organization, and Houston Astros as another sports organization, or whatever suits your domain) and Relationship Extraction. There is a NER chunker that you can plugin once you have the NLTK installed. From their examples, extracting a geo-political entity (GPE) and a person:
>>> sent = nltk.corpus.treebank.tagged_sents()[22]
>>> print nltk.ne_chunk(sent)
(S
The/DT
(GPE U.S./NNP)
is/VBZ
one/CD
...
according/VBG
to/TO
(PERSON Brooke/NNP T./NNP Mossman/NNP)
...)
Note you'll still need to know tokenization and tagging, as discussed in earlier chapters, to get your text in the right format for these IE tasks.

Natural Language Processing (NLP) is the name for parsing, well, natural language. Many algorithms and heuristics exist, and it's an active field of research. Whatever algorithm you will code, it will need to be trained on a corpus. Just like a human: we learn a language by reading text written by other people (and/or by listening to sentences uttered by other people).
In practical terms, have a look at the Natural Language Toolkit. For a theoretical underpinning of whatever you are going to code, you may want to check out Foundations of Statistical Natural Language Processing by Chris Manning and Hinrich Schütze.
(source: stanford.edu)

Here is the book I stumbled upon recently: Natural Language Processing with Python

What you want is called NP (noun phrase) chunking, or extraction.
Some links here
As pointed out, this is very problem domain specific stuff. The more you can narrow it down, the more effective it will be. And you're going to have to train your program on your specific domain.

This is a really really complicated topic. Generally, this sort of stuff falls under the rubric of Natural Language Processing, and tends to be tricky at best. The difficulty of this sort of stuff is precisely why there still is no completely automated system for handling customer service and the like.
Generally, the approach to this stuff REALLY depends on precisely what your problem domain is. If you're able to winnow down the problem domain, you can gain some very serious benefits; to use your example, if you're able to determine that your problem domain is baseball, then that gives you a really strong head start. Even then, it's a LOT of work to get anything particularly useful going.
For what it's worth, yes, an existing corpus of words is going to be useful. More importantly, determining the functional complexity expected of the system is going to be critical; do you need to parse simple sentences, or is there a need for parsing complex behavior? Can you constrain the inputs to a relatively simple set?

Regular expressions can help in some scenario. Here is a detailed example: What’s the Most Mentioned Scanner on CNET Forum, which used a regular expression to find all mentioned scanners in CNET forum posts.
In the post, a regular expression as such was used:
(?i)((?:\w+\s\w+\s(?:(?:(?:[0-9]+[a-z\-]|[a-z]+[0-9\-]|[0-9])[a-z0-9\-]*)|all-in-one|all in one)\s(\w+\s){0,1}(?:scanner|photo scanner|flatbed scanner|adf scanner|scanning|document scanner|printer scanner|portable scanner|handheld scanner|printer\/scanner))|(?:(?:scanner|photo scanner|flatbed scanner|adf scanner|scanning|document scanner|printer scanner|portable scanner|handheld scanner|printer\/scanner)\s(\w+\s){1,2}(?:(?:(?:[0-9]+[a-z\-]|[a-z]+[0-9\-]|[0-9])[a-z0-9\-]*)|all-in-one|all in one)))
in order to match either of the following:
two words, then model number (including all-in-one), then “scanner”
“scanner”, then one or two words, then model number (including
all-in-one)
As a result, the text extracted from the post was like,
discontinued HP C9900A photo scanner
scanning his old x-rays
new Epson V700 scanner
HP ScanJet 4850 scanner
Epson Perfection 3170 scanner
This regular expression solution worked in a way.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.