Going through the NLTK book, it's not clear how to generate a dependency tree from a given sentence.
The relevant section of the book: sub-chapter on dependency grammar gives an example figure but it doesn't show how to parse a sentence to come up with those relationships - or maybe I'm missing something fundamental in NLP?
EDIT:
I want something similar to what the stanford parser does:
Given a sentence "I shot an elephant in my sleep", it should return something like:
nsubj(shot-2, I-1)
det(elephant-4, an-3)
dobj(shot-2, elephant-4)
prep(shot-2, in-5)
poss(sleep-7, my-6)
pobj(in-5, sleep-7)
We can use Stanford Parser from NLTK.
Requirements
You need to download two things from their website:
The Stanford CoreNLP parser.
Language model for your desired language (e.g. english language model)
Warning!
Make sure that your language model version matches your Stanford CoreNLP parser version!
The current CoreNLP version as of May 22, 2018 is 3.9.1.
After downloading the two files, extract the zip file anywhere you like.
Python Code
Next, load the model and use it through NLTK
from nltk.parse.stanford import StanfordDependencyParser
path_to_jar = 'path_to/stanford-parser-full-2014-08-27/stanford-parser.jar'
path_to_models_jar = 'path_to/stanford-parser-full-2014-08-27/stanford-parser-3.4.1-models.jar'
dependency_parser = StanfordDependencyParser(path_to_jar=path_to_jar, path_to_models_jar=path_to_models_jar)
result = dependency_parser.raw_parse('I shot an elephant in my sleep')
dep = result.next()
list(dep.triples())
Output
The output of the last line is:
[((u'shot', u'VBD'), u'nsubj', (u'I', u'PRP')),
((u'shot', u'VBD'), u'dobj', (u'elephant', u'NN')),
((u'elephant', u'NN'), u'det', (u'an', u'DT')),
((u'shot', u'VBD'), u'prep', (u'in', u'IN')),
((u'in', u'IN'), u'pobj', (u'sleep', u'NN')),
((u'sleep', u'NN'), u'poss', (u'my', u'PRP$'))]
I think this is what you want.
I think you could use a corpus-based dependency parser instead of the grammar-based one NLTK provides.
Doing corpus-based dependency parsing on a even a small amount of text in Python is not ideal performance-wise. So in NLTK they do provide a wrapper to MaltParser, a corpus based dependency parser.
You might find this other question about RDF representation of sentences relevant.
If you need better performance, then spacy (https://spacy.io/) is the best choice. Usage is very simple:
import spacy
nlp = spacy.load('en')
sents = nlp(u'A woman is walking through the door.')
You'll get a dependency tree as output, and you can dig out very easily every information you need. You can also define your own custom pipelines. See more on their website.
https://spacy.io/docs/usage/
If you want to be serious about dependance parsing don't use the NLTK, all the algorithms are dated, and slow. Try something like this: https://spacy.io/
To use Stanford Parser from NLTK
1) Run CoreNLP Server at localhost
Download Stanford CoreNLP here (and also model file for your language).
The server can be started by running the following command (more details here)
# Run the server using all jars in the current directory (e.g., the CoreNLP home directory)
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
or by NLTK API (need to configure the CORENLP_HOME environment variable first)
os.environ["CORENLP_HOME"] = "dir"
client = corenlp.CoreNLPClient()
# do something
client.stop()
2) Call the dependency parser from NLTK
>>> from nltk.parse.corenlp import CoreNLPDependencyParser
>>> dep_parser = CoreNLPDependencyParser(url='http://localhost:9000')
>>> parse, = dep_parser.raw_parse(
... 'The quick brown fox jumps over the lazy dog.'
... )
>>> print(parse.to_conll(4))
The DT 4 det
quick JJ 4 amod
brown JJ 4 amod
fox NN 5 nsubj
jumps VBZ 0 ROOT
over IN 9 case
the DT 9 det
lazy JJ 9 amod
dog NN 5 nmod
. . 5 punct
See detail documentation here, also this question NLTK CoreNLPDependencyParser: Failed to establish connection.
From the Stanford Parser documentation: "the dependencies can be obtained using our software [...] on phrase-structure trees using the EnglishGrammaticalStructure class available in the parser package." http://nlp.stanford.edu/software/stanford-dependencies.shtml
The dependencies manual also mentions: "Or our conversion tool can convert the
output of other constituency parsers to the Stanford Dependencies representation." http://nlp.stanford.edu/software/dependencies_manual.pdf
Neither functionality seem to be implemented in NLTK currently.
A little late to the party, but I wanted to add some example code with SpaCy that gets you your desired output:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("I shot an elephant in my sleep")
for token in doc:
print("{2}({3}-{6}, {0}-{5})".format(token.text, token.tag_, token.dep_, token.head.text, token.head.tag_, token.i+1, token.head.i+1))
And here's the output, very similar to your desired output:
nsubj(shot-2, I-1)
ROOT(shot-2, shot-2)
det(elephant-4, an-3)
dobj(shot-2, elephant-4)
prep(shot-2, in-5)
poss(sleep-7, my-6)
pobj(in-5, sleep-7)
Hope that helps!
Related
I am working on Stanford NLP for one my Python project. I want to fetch word, lemma, xpos, governor and dependencies from it. But the output produced by the API is in String format and like this :
<Token index=4;words=[<Word index=4;text=born;lemma=bear;upos=VERB;xpos=VBN;feats=Tense=Past|VerbForm=Part|Voice=Pass;governor=0;dependency_relation=root>]>
<Token index=5;words=[<Word index=5;text=in;lemma=in;upos=ADP;xpos=IN;feats=_;governor=6;dependency_relation=case>]>
<Token index=6;words=[<Word index=6;text=Hawaii;lemma=Hawaii;upos=PROPN;xpos=NNP;feats=Number=Sing;governor=4;dependency_relation=obl>]>
<Token index=7;words=[<Word index=7;text=.;lemma=.;upos=PUNCT;xpos=.;feats=_;governor=4;dependency_relation=punct>]>
I want to know how to parse the result to get it into an easy and accessible format. Or Can I convert it into tree form? Or Is there any other library available that gives me lemma, pos tag and dependencies like this?
After a bit of research I have found that there is no need to convert this output into tree structure, you can just parse that in stanford or the latest stanza library by using this:
nlp = stanza.Pipeline('en') # This sets up a default neural pipeline in English
doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
doc.sentences[0].print_dependencies()
for sent in doc.sentences:
for word in sent.words:
print(word.id)
print(word.text)
print(word.lemma)
print(word.xpos)
print(word.upos)
you can easily process the result by this, instead of printing the value you can add you logic.
Also, you can checkout Spacy for that, as per #ygorg comment. It has similar feature as stanford nlp with dependencies also.
I have been going through many Libraries like whoosh/nltk and concepts like word net.
However I am unable to tackle my problem. I am not sure if I can find a library for this or I have to build this using the above mentioned resources.
Question:
My scenario is that I have to search for key words.
Say I have key words like 'Sales Document' / 'Purchase Documents' and have to search for them in a small 10-15 pages book.
The catch is:
Now they can also be written as 'Sales should be documented' or 'company selling should be written in the text files'. (For Sales Document - Keyword) Is there an approach here or will I have to build something?
The code for the POS Tags is as follows. If no library is available I will have to proceed with this.
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from pandas import Series
import nltk
from nltk.corpus import wordnet
def tag(x):
return pos_tag(word_tokenize(x))
synonyms = []
antonyms = []
for syn in wordnet.synsets("Sales document"):
#print("Down2")
print (syn)
#print("Down")
for l in syn.lemmas():
print(" \n")
print(l)
synonyms.append(l.name())
if l.antonyms():
antonyms.append(l.antonyms()[0].name())
print(set(synonyms))
print(set(antonyms))
for i in synonyms:
print(tag(i))
Update:
We went ahead and made a python program - Feel free to fork it. (Pun intended)
Further the Git Dhund is very untidy right now will clean it once completed.
Currently it is still in a development phase.
The is the link.
To match occurrences like "Sales should be documented", this can be done by increasing the slop parameter in the Phrase query object of Whoosh.
whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)
slop – the number of words allowed between each “word” in the phrase; the default of 1 means the phrase must match exactly.
You can also define slop in Query like this: "Sales should be documented"~5
To match the second example "company selling should be written in the text files", this needs a semantic processing for your texts. Whoosh has a low-level implementation for wordnet thesaurus to allow you index synonyms but it has only one-word synonyms.
Recently I approached to the NLP and I tried to use NLTK and TextBlob for analyzing texts. I would like to develop an app that analyzes reviews made by travelers and so I have to manage a lot of texts written in different languages. I need to do two main operations: POS Tagging and lemmatization. I have seen that in NLTK there is a possibility to choice the the right language for sentences tokenization like this:
tokenizer = nltk.data.load('tokenizers/punkt/PY3/italian.pickle')
I haven't found the the right way to set the language for POS Tagging and Lemmatizer in different languages yet. How can I set the correct corpora/dictionary for non-english texts such as Italian, French, Spanish or German? I also see that there is a possibility to import the "TreeBank" or "WordNet" modules, but I don't understand how I can use them. Otherwise, where can I find the respective corporas?
Can you give me some suggestion or reference? Please take care that I'm not an expert of NLTK.
Many Thanks.
If you are looking for another multilingual POS tagger, you might want to try RDRPOSTagger: a robust, easy-to-use and language-independent toolkit for POS and morphological tagging. See experimental results including performance speed and tagging accuracy on 13 languages in this paper. RDRPOSTagger now supports pre-trained POS and morphological tagging models for Bulgarian, Czech, Dutch, English, French, German, Hindi, Italian, Portuguese, Spanish, Swedish, Thai and Vietnamese. RDRPOSTagger also supports the pre-trained Universal POS tagging models for 40 languages.
In Python, you can utilize the pre-trained models for tagging a raw unlabeled text corpus as:
python RDRPOSTagger.py tag PATH-TO-PRETRAINED-MODEL PATH-TO-LEXICON PATH-TO-RAW-TEXT-CORPUS
Example: python RDRPOSTagger.py tag ../Models/POS/German.RDR ../Models/POS/German.DICT ../data/GermanRawTest
If you would like to program with RDRPOSTagger, please follow code lines 92-98 in RDRPOSTagger.py module in pSCRDRTagger package. Here is an example:
r = RDRPOSTagger()
r.constructSCRDRtreeFromRDRfile("../Models/POS/German.RDR") #Load POS tagging model for German
DICT = readDictionary("../Models/POS/German.DICT") #Load a German lexicon
r.tagRawSentence(DICT, "Die Reaktion des deutschen Außenministers zeige , daß dieser die außerordentlich wichtige Rolle Irans in der islamischen Welt erkenne .")
r = RDRPOSTagger()
r.constructSCRDRtreeFromRDRfile("../Models/POS/French.RDR") # Load POS tagging model for French
DICT = readDictionary("../Models/POS/French.DICT") # Load a French lexicon
r.tagRawSentence(DICT, "Cette annonce a fait l' effet d' une véritable bombe . ")
There is no option that you can pass to NLTK's POS-tagging and lemmatizing functions that will make them process other languages.
One solution would be to get a training corpus for each language and to train your own POS-taggers with NLTK, then figure out a lemmatizing solution, maybe dictonary-based, for each language.
That might be overkill though, as there is already a single stop solution for both tasks in Italian, French, Spanish and German (and many other languages): TreeTagger. It is not as state-of-the-art as the POS-taggers and lemmatizers in English, but it still does a good job.
What you want is to install TreeTagger on your system and be able to call it from Python. Here is a GitHub repo by miotto that lets you do just that.
The following snippet shows you how to test that you set up everything correctly. As you can see, I am able to POS-tag and lemmatize in one function call, and I can do it just as easily in English and in French.
>>> import os
>>> os.environ['TREETAGGER'] = "/opt/treetagger/cmd" # Or wherever you installed TreeTagger
>>> from treetagger import TreeTagger
>>> tt_en = TreeTagger(encoding='utf-8', language='english')
>>> tt_en.tag('Does this thing even work?')
[[u'Does', u'VBZ', u'do'], [u'this', u'DT', u'this'], [u'thing', u'NN', u'thing'], [u'even', u'RB', u'even'], [u'work', u'VB', u'work'], [u'?', u'SENT', u'?']]
>>> tt_fr = TreeTagger(encoding='utf-8', language='french')
>>> tt_fr.tag(u'Mon Dieu, faites que ça marche!')
[[u'Mon', u'DET:POS', u'mon'], [u'Dieu', u'NOM', u'Dieu'], [u',', u'PUN', u','], [u'faites', u'VER:pres', u'faire'], [u'que', u'KON', u'que'], [u'\xe7a', u'PRO:DEM', u'cela'], [u'marche', u'NOM', u'marche'], [u'!', u'SENT', u'!']]
Since this question gets asked a lot (and since the installation process is not super straight-forward, IMO), I will write a blog post on the matter and update this answer with a link to it as soon as it is done.
EDIT:
Here is the above-mentioned blog post.
I quite like using SpaCy for multilingual NLP. They have trained models for Catalan, Chinese, Danish, Dutch, English, French, German, Greek, Italian, Japanese, Lithuanian, Macedonian, Norwegian Bokmäl, Polish, Portuguese, Romanian, Russian and Spanish.
You would simply load a different model depending on the language you're working with:
import spacy
nlp_DE = spacy.load("de_core_news_sm")
nlp_FR = spacy.load("fr_core_news_sm")
It's not as accurate as Treetagger or Hanovertagger but it is very easy to use while outputting useable results that are much better than NLTK.
The goal of my project is to answer queries such as, for example:
"I am looking for American women between 20 and 30 years old who work in Google"
I then have to process the query and to look into a DB to find the answer.
For this, I would need to combine the Stanford 3-class NERTagger and my own tagger. Indeed, my NER tagger can tag ages, nationalities and gender. But I need the Stanford tagger to tag organizations as I don't have any training file for this.
Right now, I have a code like this:
def __init__(self, q):
self.userQuery = q
def get_tagged_tokens(self):
st = NERTagger('C:\stanford-ner-2015-01-30\my-ner-model.ser.gz','C:\stanford-ner-2015-01-30\stanford-ner.jar')
result = st.tag(self.userQuery.split())[0]
return result
And I would like to have something like this:
def get_tagged_tokens(self):
st = NERTagger('C:\stanford-ner-2015-01-30\my-ner-model.ser.gz','C:\stanford-ner-2015-01-30\stanford-ner.jar')
st_def = NERTagger('C:\stanford-ner-2015-01-30\classifiers\english.all.3class.distsim.crf.ser.gz','C:\stanford-ner-2015-01-30\stanford-ner.jar')
tagger = BackoffTagger([st, st_def])
result = st.tag(self.userQuery.split())[0]
return result
This would mean that the tagger first uses my tagger and then the stanford one to tag untagged words.
Is it possible to combine my model with the Stanford model just to tag organizations? If yes, what is the best way to perform this?
Thank you!
The new NERClassifierCombiner with Stanford CoreNLP 3.5.2 or the new Stanford NER 3.5.2 has added command line functionality that makes it easy to get this effect with NLTK.
When you provide a list of serialized classifiers, NERClassifierCombiner will run them in sequence. After one tagger tags the sentence, no other taggers will tag tokens that have already been tagged. So note in my demo code I provide 2 classifiers as an example. They are run in the order you place them. I believe you can put as many as 10 in there if I recall correctly!
First, make sure that you have the latest copy of Stanford CoreNLP 3.5.2 or Stanford NER 3.5.2 , so that you have the right .jar file with this new functionality.
Second, make sure your custom NER model was built with Stanford CoreNLP or Stanford NER, this won't work otherwise! It should be ok if you used older versions.
Third, I have provided some sample code that should work, the main gist of this is to subclass NERTagger:
If people would like I could look into pushing this to NLTK so it is in there by default!
Here is some sample code (it is a little hacky since I was just rushing this out the door, for instance in NERComboTagger's constructor there is no point to the first argument being classifier_path1, but the code would crash if I didn't put a valid file there):
#!/usr/bin/python
from nltk.tag.stanford import NERTagger
class NERComboTagger(NERTagger):
def __init__(self, *args, **kwargs):
self.stanford_ner_models = kwargs['stanford_ner_models']
kwargs.pop("stanford_ner_models")
super(NERComboTagger,self).__init__(*args, **kwargs)
#property
def _cmd(self):
return ['edu.stanford.nlp.ie.NERClassifierCombiner',
'-ner.model',
self.stanford_ner_models,
'-textFile',
self._input_file_path,
'-outputFormat',
self._FORMAT,
'-tokenizerFactory',
'edu.stanford.nlp.process.WhitespaceTokenizer',
'-tokenizerOptions',
'\"tokenizeNLs=false\"']
classifier_path1 = "classifiers/english.conll.4class.distsim.crf.ser.gz"
classifier_path2 = "classifiers/english.muc.7class.distsim.crf.ser.gz"
ner_jar_path = "stanford-ner.jar"
st = NERComboTagger(classifier_path1,ner_jar_path,stanford_ner_models=classifier_path1+","+classifier_path2)
print st.tag("Barack Obama is from Hawaii .".split(" "))
Note the major change in the subclass is what is returned by _cmd .
Also note that I ran this in the unzipped folder stanford-ner-2015-04-20 , so the paths are relative to that.
I get this output:
[('Barack','PERSON'), ('Obama', 'PERSON'), ('is','O'), ('from', 'O'), ('Hawaii', 'LOCATION'), ('.', 'O')]
Here is a link to the Stanford NER page:
http://nlp.stanford.edu/software/CRF-NER.shtml
Please let me know if you need any more help or if there are any errors in my code, I may have made a mistake while transcribing, but it works on my laptop!
I am using NLTK to extract nouns from a text-string starting with the following command:
tagged_text = nltk.pos_tag(nltk.Text(nltk.word_tokenize(some_string)))
It works fine in English. Is there an easy way to make it work for German as well?
(I have no experience with natural language programming, but I managed to use the python nltk library which is great so far.)
Natural language software does its magic by leveraging corpora and the statistics they provide. You'll need to tell nltk about some German corpus to help it tokenize German correctly. I believe the EUROPARL corpus might help get you going.
See nltk.corpus.europarl_raw and this answer for example configuration.
Also, consider tagging this question with "nlp".
The Pattern library includes a function for parsing German sentences and the result includes the part-of-speech tags. The following is copied from their documentation:
from pattern.de import parse, split
s = parse('Die Katze liegt auf der Matte.')
s = split(s)
print s.sentences[0]
>>> Sentence('Die/DT/B-NP/O Katze/NN/I-NP/O liegt/VB/B-VP/O'
'auf/IN/B-PP/B-PNP der/DT/B-NP/I-PNP Matte/NN/I-NP/I-PNP ././O/O')
Update: Another option is spacy, there is a quick example in this blog article:
import spacy
nlp = spacy.load('de')
doc = nlp(u'Ich bin ein Berliner.')
# show universal pos tags
print(' '.join('{word}/{tag}'.format(word=t.orth_, tag=t.pos_) for t in doc))
# output: Ich/PRON bin/AUX ein/DET Berliner/NOUN ./PUNCT
Part-of-Speech (POS) tagging is very specific to a particular [natural] language. NLTK includes many different taggers, which use distinct techniques to infer the tag of a given token in a given token. Most (but not all) of these taggers use a statistical model of sorts as the main or sole device to "do the trick". Such taggers require some "training data" upon which to build this statistical representation of the language, and the training data comes in the form of corpora.
The NTLK "distribution" itself includes many of these corpora, as well a set of "corpora readers" which provide an API to read different types of corpora. I don't know the state of affairs in NTLK proper, and if this includes any german corpus. You can however locate free some free corpora which you'll then need to convert to a format that satisfies the proper NTLK corpora reader, and then you can use this to train a POS tagger for the German language.
You can even create your own corpus, but that is a hell of a painstaking job; if you work in a univeristy, you gotta find ways of bribing and otherwise coercing students to do that for you ;-)
Possibly you can use the Stanford POS tagger. Below is a recipe I wrote. There are python recipes for German NLP that I've compiled and you can access them on http://htmlpreview.github.io/?https://github.com/alvations/DLTK/blob/master/docs/index.html
#-*- coding: utf8 -*-
import os, glob, codecs
def installStanfordTag():
if not os.path.exists('stanford-postagger-full-2013-06-20'):
os.system('wget http://nlp.stanford.edu/software/stanford-postagger-full-2013-06-20.zip')
os.system('unzip stanford-postagger-full-2013-06-20.zip')
return
def tag(infile):
cmd = "./stanford-postagger.sh "+models[m]+" "+infile
tagout = os.popen(cmd).readlines()
return [i.strip() for i in tagout]
def taglinebyline(sents):
tagged = []
for ss in sents:
os.popen("echo '''"+ss+"''' > stanfordtemp.txt")
tagged.append(tag('stanfordtemp.txt')[0])
return tagged
installStanfordTag()
stagdir = './stanford-postagger-full-2013-06-20/'
models = {'fast':'models/german-fast.tagger',
'dewac':'models/german-dewac.tagger',
'hgc':'models/german-hgc.tagger'}
os.chdir(stagdir)
print os.getcwd()
m = 'fast' # It's best to use the fast german tagger if your data is small.
sentences = ['Ich bin schwanger .','Ich bin wieder schwanger .','Ich verstehe nur Bahnhof .']
tagged_sents = taglinebyline(sentences) # Call the stanford tagger
for sent in tagged_sents:
print sent
I have written a blog-post about how to convert the German annotated TIGER Corpus in order to use it with the NLTK. Have a look at it here.
It seems to be a little late to answer the question, but it might be helpful for anyone who finds this question by googling like i did. So i'd like to share the things I found out.
The HannoverTagger might be a useful tool for this Task.
You can find tutorials here and here(german), but the second one is in german.
The Tagger seems to use the STTS Tagset, if you need a complete list of all Tags.