Convert String Token into Tree in Python (Stanford NLP) - python

I am working on Stanford NLP for one my Python project. I want to fetch word, lemma, xpos, governor and dependencies from it. But the output produced by the API is in String format and like this :
<Token index=4;words=[<Word index=4;text=born;lemma=bear;upos=VERB;xpos=VBN;feats=Tense=Past|VerbForm=Part|Voice=Pass;governor=0;dependency_relation=root>]>
<Token index=5;words=[<Word index=5;text=in;lemma=in;upos=ADP;xpos=IN;feats=_;governor=6;dependency_relation=case>]>
<Token index=6;words=[<Word index=6;text=Hawaii;lemma=Hawaii;upos=PROPN;xpos=NNP;feats=Number=Sing;governor=4;dependency_relation=obl>]>
<Token index=7;words=[<Word index=7;text=.;lemma=.;upos=PUNCT;xpos=.;feats=_;governor=4;dependency_relation=punct>]>
I want to know how to parse the result to get it into an easy and accessible format. Or Can I convert it into tree form? Or Is there any other library available that gives me lemma, pos tag and dependencies like this?

After a bit of research I have found that there is no need to convert this output into tree structure, you can just parse that in stanford or the latest stanza library by using this:
nlp = stanza.Pipeline('en') # This sets up a default neural pipeline in English
doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
doc.sentences[0].print_dependencies()
for sent in doc.sentences:
for word in sent.words:
print(word.id)
print(word.text)
print(word.lemma)
print(word.xpos)
print(word.upos)
you can easily process the result by this, instead of printing the value you can add you logic.
Also, you can checkout Spacy for that, as per #ygorg comment. It has similar feature as stanford nlp with dependencies also.

Related

spacy lemmatizing inconsistency with lemma_lookup table

There seems to be an inconsistency when iterating over a spacy document and lemmatizing the tokens compared to looking up the lemma of the word in the Vocab lemma_lookup table.
nlp = spacy.load("en_core_web_lg")
doc = nlp("I'm running faster")
for tok in doc:
print(tok.lemma_)
This prints out "faster" as lemma for the token "faster" instead of "fast". However the token does exist in the lemma_lookup table.
nlp.vocab.lookups.get_table("lemma_lookup")["faster"]
which outputs "fast"
Am I doing something wrong? Or is there another reason why these two are different? Maybe my definitions are not correct and I'm comparing apples with oranges?
I'm using the following versions on Ubuntu Linux:
spacy==2.2.4
spacy-lookups-data==0.1.0
With a model like en_core_web_lg that includes a tagger and rules for a rule-based lemmatizer, it provides the rule-based lemmas rather than the lookup lemmas when POS tags are available to use with the rules. The lookup lemmas aren't great overall and are only used as a backup if the model/pipeline doesn't have enough information to provide the rule-based lemmas.
With faster, the POS tag is ADV, which is left as-is by the rules. If it had been tagged as ADJ, the lemma would be fast with the current rules.
The lemmatizer tries to provide the best lemmas it can without requiring the user to manage any settings, but it's also not very configurable right now (v2.2). If you want to run the tagger but have lookup lemmas, you'll have to replace the lemmas after running the tagger.
aab wrote, that:
The lookup lemmas aren't great overall and are only used as a backup
if the model/pipeline doesn't have enough information to provide the
rule-based lemmas.
This is also how I understood it from the spaCy code, but since I wanted to add my own dictionaries to improve the lemmatization of the pretrained models, I decided to try out the following, which worked:
#load model
nlp = spacy.load('es_core_news_lg')
#define dictionary, where key = lemma, value = token to be lemmatized - not case-sensitive
corr_es = {
"decir":["dixo", "decia", "Dixo", "Decia"],
"ir":["iba", "Iba"],
"pacerer":["parecia", "Parecia"],
"poder":["podia", "Podia"],
"ser":["fuesse", "Fuesse"],
"haber":["habia", "havia", "Habia", "Havia"],
"ahora" : ["aora", "Aora"],
"estar" : ["estàn", "Estàn"],
"lujo" : ["luxo","luxar", "Luxo","Luxar"],
"razón" : ["razon", "razòn", "Razon", "Razòn"],
"caballero" : ["cavallero", "Cavallero"],
"mujer" : ["muger", "mugeres", "Muger", "Mugeres"],
"vez" : ["vèz", "Vèz"],
"jamás" : ["jamas", "Jamas"],
"demás" : ["demas", "demàs", "Demas", "Demàs"],
"cuidar" : ["cuydado", "Cuydado"],
"posible" : ["possible", "Possible"],
"comedia":["comediar", "Comedias"],
"poeta":["poetas", "Poetas"],
"mano":["manir", "Manir"],
"barba":["barbar", "Barbar"],
"idea":["ideo", "Ideo"]
}
#replace lemma with key in lookup table
for key, value in corr_es.items():
for token in value:
correct = key
wrong = token
nlp.vocab.lookups.get_table("lemma_lookup")[token] = key
#process the text
nlp(text)
Hopefully this could help.

Is there a way to identify cities in a text without maintaining a prior vocabulary, in Python?

I have to identify cities in a document (has only characters), I do not want to maintain an entire vocabulary as it is not a practical solution. I also do not have Azure text analytics api account.
I have already tried using Spacy, I did ner and identified geolocation and that output is passed to spellchecker() to train the model. But the issue with this is that ner requires sentences and my input has words.
I am relatively new to this field.
You can check out the geotext library.
Working example with a sentence:
text = "The capital of Belarus is Minsk. Minsk is not so far away from Kiev or Moscow. Russians and Belarussians are nice people."
from geotext import GeoText
places = GeoText(text)
print(places.cities)
Output:
['Minsk', 'Minsk', 'Kiev', 'Moscow']
Working example with list of words:
wordList = ['London', 'cricket', 'biryani', 'Vilnius', 'Delhi']
for i in range(len(wordList)):
places = GeoText(wordList[i])
if places.cities:
print(places.cities)
Output:
['London']
['Vilnius']
['Delhi']
geograpy is another alternative. However, I find geotext light due to lesser number of external dependencies.
there is a list of libraries that may help you,
but from my experience, there is not a perfect library for this. If you know all the cities that may appear in the text, then vocabulary is the best thing

How to get the base form of an adj or adverb using lemma in spacy

For a project, I would like to be able to get the noun form of an adjective or adverb if there is one using NLP.
For example, "deathly" would return "death" and "dead" would return "death".
"lively" would return "life".
I've tried using the spacy lemmatizer but it does not manage to get the base radical form.
For example, if I'd do:
import spacy
nlp = spacy.load('en_core_web_sm')
z = nlp("deathly lively")
for token in z:
print(token.lemma_)
It would return:
>>> deathly lively
instead of:
>>> death life
Does anyone have any ideas?
Any answer is appreciated.
From what I've seen so far, SpaCy is not super-great at doing what you want it to do. Instead, I am using a 3rd party library called pyinflect, which is intended to be used as an extension to SpaCy.
While it isn't perfect, I think it will work better than your current approach.
I'm also considering another 3rd-party library called inflect, which might be worth checking out, as well.

NLTK : combining stanford tagger and personal tagger

The goal of my project is to answer queries such as, for example:
"I am looking for American women between 20 and 30 years old who work in Google"
I then have to process the query and to look into a DB to find the answer.
For this, I would need to combine the Stanford 3-class NERTagger and my own tagger. Indeed, my NER tagger can tag ages, nationalities and gender. But I need the Stanford tagger to tag organizations as I don't have any training file for this.
Right now, I have a code like this:
def __init__(self, q):
self.userQuery = q
def get_tagged_tokens(self):
st = NERTagger('C:\stanford-ner-2015-01-30\my-ner-model.ser.gz','C:\stanford-ner-2015-01-30\stanford-ner.jar')
result = st.tag(self.userQuery.split())[0]
return result
And I would like to have something like this:
def get_tagged_tokens(self):
st = NERTagger('C:\stanford-ner-2015-01-30\my-ner-model.ser.gz','C:\stanford-ner-2015-01-30\stanford-ner.jar')
st_def = NERTagger('C:\stanford-ner-2015-01-30\classifiers\english.all.3class.distsim.crf.ser.gz','C:\stanford-ner-2015-01-30\stanford-ner.jar')
tagger = BackoffTagger([st, st_def])
result = st.tag(self.userQuery.split())[0]
return result
This would mean that the tagger first uses my tagger and then the stanford one to tag untagged words.
Is it possible to combine my model with the Stanford model just to tag organizations? If yes, what is the best way to perform this?
Thank you!
The new NERClassifierCombiner with Stanford CoreNLP 3.5.2 or the new Stanford NER 3.5.2 has added command line functionality that makes it easy to get this effect with NLTK.
When you provide a list of serialized classifiers, NERClassifierCombiner will run them in sequence. After one tagger tags the sentence, no other taggers will tag tokens that have already been tagged. So note in my demo code I provide 2 classifiers as an example. They are run in the order you place them. I believe you can put as many as 10 in there if I recall correctly!
First, make sure that you have the latest copy of Stanford CoreNLP 3.5.2 or Stanford NER 3.5.2 , so that you have the right .jar file with this new functionality.
Second, make sure your custom NER model was built with Stanford CoreNLP or Stanford NER, this won't work otherwise! It should be ok if you used older versions.
Third, I have provided some sample code that should work, the main gist of this is to subclass NERTagger:
If people would like I could look into pushing this to NLTK so it is in there by default!
Here is some sample code (it is a little hacky since I was just rushing this out the door, for instance in NERComboTagger's constructor there is no point to the first argument being classifier_path1, but the code would crash if I didn't put a valid file there):
#!/usr/bin/python
from nltk.tag.stanford import NERTagger
class NERComboTagger(NERTagger):
def __init__(self, *args, **kwargs):
self.stanford_ner_models = kwargs['stanford_ner_models']
kwargs.pop("stanford_ner_models")
super(NERComboTagger,self).__init__(*args, **kwargs)
#property
def _cmd(self):
return ['edu.stanford.nlp.ie.NERClassifierCombiner',
'-ner.model',
self.stanford_ner_models,
'-textFile',
self._input_file_path,
'-outputFormat',
self._FORMAT,
'-tokenizerFactory',
'edu.stanford.nlp.process.WhitespaceTokenizer',
'-tokenizerOptions',
'\"tokenizeNLs=false\"']
classifier_path1 = "classifiers/english.conll.4class.distsim.crf.ser.gz"
classifier_path2 = "classifiers/english.muc.7class.distsim.crf.ser.gz"
ner_jar_path = "stanford-ner.jar"
st = NERComboTagger(classifier_path1,ner_jar_path,stanford_ner_models=classifier_path1+","+classifier_path2)
print st.tag("Barack Obama is from Hawaii .".split(" "))
Note the major change in the subclass is what is returned by _cmd .
Also note that I ran this in the unzipped folder stanford-ner-2015-04-20 , so the paths are relative to that.
I get this output:
[('Barack','PERSON'), ('Obama', 'PERSON'), ('is','O'), ('from', 'O'), ('Hawaii', 'LOCATION'), ('.', 'O')]
Here is a link to the Stanford NER page:
http://nlp.stanford.edu/software/CRF-NER.shtml
Please let me know if you need any more help or if there are any errors in my code, I may have made a mistake while transcribing, but it works on my laptop!

POS tagging in German

I am using NLTK to extract nouns from a text-string starting with the following command:
tagged_text = nltk.pos_tag(nltk.Text(nltk.word_tokenize(some_string)))
It works fine in English. Is there an easy way to make it work for German as well?
(I have no experience with natural language programming, but I managed to use the python nltk library which is great so far.)
Natural language software does its magic by leveraging corpora and the statistics they provide. You'll need to tell nltk about some German corpus to help it tokenize German correctly. I believe the EUROPARL corpus might help get you going.
See nltk.corpus.europarl_raw and this answer for example configuration.
Also, consider tagging this question with "nlp".
The Pattern library includes a function for parsing German sentences and the result includes the part-of-speech tags. The following is copied from their documentation:
from pattern.de import parse, split
s = parse('Die Katze liegt auf der Matte.')
s = split(s)
print s.sentences[0]
>>> Sentence('Die/DT/B-NP/O Katze/NN/I-NP/O liegt/VB/B-VP/O'
'auf/IN/B-PP/B-PNP der/DT/B-NP/I-PNP Matte/NN/I-NP/I-PNP ././O/O')
Update: Another option is spacy, there is a quick example in this blog article:
import spacy
nlp = spacy.load('de')
doc = nlp(u'Ich bin ein Berliner.')
# show universal pos tags
print(' '.join('{word}/{tag}'.format(word=t.orth_, tag=t.pos_) for t in doc))
# output: Ich/PRON bin/AUX ein/DET Berliner/NOUN ./PUNCT
Part-of-Speech (POS) tagging is very specific to a particular [natural] language. NLTK includes many different taggers, which use distinct techniques to infer the tag of a given token in a given token. Most (but not all) of these taggers use a statistical model of sorts as the main or sole device to "do the trick". Such taggers require some "training data" upon which to build this statistical representation of the language, and the training data comes in the form of corpora.
The NTLK "distribution" itself includes many of these corpora, as well a set of "corpora readers" which provide an API to read different types of corpora. I don't know the state of affairs in NTLK proper, and if this includes any german corpus. You can however locate free some free corpora which you'll then need to convert to a format that satisfies the proper NTLK corpora reader, and then you can use this to train a POS tagger for the German language.
You can even create your own corpus, but that is a hell of a painstaking job; if you work in a univeristy, you gotta find ways of bribing and otherwise coercing students to do that for you ;-)
Possibly you can use the Stanford POS tagger. Below is a recipe I wrote. There are python recipes for German NLP that I've compiled and you can access them on http://htmlpreview.github.io/?https://github.com/alvations/DLTK/blob/master/docs/index.html
#-*- coding: utf8 -*-
import os, glob, codecs
def installStanfordTag():
if not os.path.exists('stanford-postagger-full-2013-06-20'):
os.system('wget http://nlp.stanford.edu/software/stanford-postagger-full-2013-06-20.zip')
os.system('unzip stanford-postagger-full-2013-06-20.zip')
return
def tag(infile):
cmd = "./stanford-postagger.sh "+models[m]+" "+infile
tagout = os.popen(cmd).readlines()
return [i.strip() for i in tagout]
def taglinebyline(sents):
tagged = []
for ss in sents:
os.popen("echo '''"+ss+"''' > stanfordtemp.txt")
tagged.append(tag('stanfordtemp.txt')[0])
return tagged
installStanfordTag()
stagdir = './stanford-postagger-full-2013-06-20/'
models = {'fast':'models/german-fast.tagger',
'dewac':'models/german-dewac.tagger',
'hgc':'models/german-hgc.tagger'}
os.chdir(stagdir)
print os.getcwd()
m = 'fast' # It's best to use the fast german tagger if your data is small.
sentences = ['Ich bin schwanger .','Ich bin wieder schwanger .','Ich verstehe nur Bahnhof .']
tagged_sents = taglinebyline(sentences) # Call the stanford tagger
for sent in tagged_sents:
print sent
I have written a blog-post about how to convert the German annotated TIGER Corpus in order to use it with the NLTK. Have a look at it here.
It seems to be a little late to answer the question, but it might be helpful for anyone who finds this question by googling like i did. So i'd like to share the things I found out.
The HannoverTagger might be a useful tool for this Task.
You can find tutorials here and here(german), but the second one is in german.
The Tagger seems to use the STTS Tagset, if you need a complete list of all Tags.

Categories

Resources