I'm trying to find a package/library which enables me to determine to which noun a verb refers (in Python):
The man was walking down the street
That this will give me a result like walking refers to man.
I tried to use nltk however as far as I could find out I can only tag the wordst as nouns/verbs etc. but cannot infer any references from this.
Question
Are there any packages which are capable to do the above?
I think what you want to do is explore the syntactic dependencies among words. For that, you need to parse your text with a syntactic parser.
Since you want to know the references between nouns and verbs you will need to do two things.
You need to get the part-of-speech tags associated to each words (i.e. the morphology associated to each word, wether it's a ADJ, DET, VERB, NOUN, etc.) then you want to select the ones tagged as verbs and nouns.
Then, you want to look at which other words they connect with, I think mostly you will want to explore the 'nsubj' dependency.
spaCy is a NLP library for Python that performs syntactic parsing, and also has a on-line demo, if you want to try it out, check:
https://demos.explosion.ai/displacy/
Here is the output for the example you gave:
Related
Hello i'm looking to find a solution of my issue :
I Want to find a list of similar words with french and english
For example :
name could be : first name, last name, nom, prénom, username....
Postal address could be : city, country, street, ville, pays, code postale ....
The other answer, and comments, describe how to get synonyms, but I think you want more than that?
I can suggest two broad approaches: WordNet and word embeddings.
Using nltk and wordnet, you want to explore the adjacent graph nodes. See http://www.nltk.org/howto/wordnet.html for an overview of the functions available. I'd suggest that once you've found your start word in Wordnet, follow all its relations, but also go up to the hypernym, and do the same there.
Finding the start word is not always easy:
http://wordnetweb.princeton.edu/perl/webwn?s=Postal+address&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=
Instead it seems I have to use "address": http://wordnetweb.princeton.edu/perl/webwn?s=address&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=
and then decide which of those is the correct sense here. Then try clicking the hypernym, hyponym, sister term, etc.
To be honest, none of those feels quite right.
Open Multilingual WordNet tries to link different languages. http://compling.hss.ntu.edu.sg/omw/ So you could take your English WordNet code, and move to the French WordNet with it, or vice versa.
The other approach is to use word embeddings. You find the, say, 300 dimensional, vector of your source word, and then hunt for the nearest words in that vector space. This will be returning words that are used in similar contexts, so they could be similar meaning, or similar syntactically.
Spacy has a good implementation, see https://spacy.io/usage/spacy-101#vectors-similarity and https://spacy.io/usage/vectors-similarity
Regarding English and French, normally you would work in the two languages independently. But if you search for "multilingual word embeddings" you will find some papers and projects where the vector stays the same for the same concept in different languages.
Note: the API is geared towards telling you how two words are similar, not finding similar words. To find similar words you need to take your vector and compare with every other word vector, which is O(N) in the size of the vocabulary. So you might want to do this offline, and build your own "synonyms-and-similar" dictionary for each word of interest.
from PyDictionary import PyDictionary
dictionary=PyDictionary()
answer = dictionary.synonym(word)
word is the word for which you are finding the synonyms.
I have a list of words and would like to keep only nouns.
This is not a duplicate of Extracting all Nouns from a text file using nltk
In the linked question a piece of text is processed. The accepted answer proposes a tagger. I'm aware of the different options for tagging text (nlkt, textblob, spacy), but I can't use them, since my data doesn't consist of sentences. I only have a list of individual words:
would
research
part
technologies
size
articles
analyzes
line
nltk has a wide selection of corpora. I found verbnet with a comprehensive list of verbs. But so far I didn't see anything similar for nouns. Is there something like a dictionary, where I can look up if a word is a noun, verb, adjective, etc ?
This could probably done by some online service. Microsoft translate for example returns a lot of information in their responses: https://learn.microsoft.com/en-us/azure/cognitive-services/translator/reference/v3-0-dictionary-lookup?tabs=curl
But this is a paid service. I would prefer a python package.
Regarding the ambiguity of words: Ideally I would like a dictionary that can tell me all the functions a word can have. "fish" for example is both noun and verb. "eat" is only verb, "dog" is only noun. I'm aware that this is not an exact science. A working solution would simply remove all words that can't be nouns.
Tried using wordnet?
from nltk.corpus import wordnet
words = ["would","research","part","technologies","size","articles","analyzes","line"]
for w in words:
syns = wordnet.synsets(w)
print(w, syns[0].lexname().split('.')[0]) if syns else (w, None)
You should see:
('would', None)
('research', u'noun')
('part', u'noun')
('technologies', u'noun')
('size', u'noun')
('articles', u'noun')
('analyzes', u'verb')
('line', u'noun')
You can run a POS tagger on individual fragments, it will have lower accuracy but I suppose that's already a given.
Ideally, find a POS tagger which reveals every possible reading for possible syntactic disambiguation later on in the processing pipeline. This will basically just pick out all the possible readings from the lexicon (perhaps with a probability) and let you take it from there.
Even if you use a dictionary, you will always have to deal with ambiguity, for example, the same word depending on the context can be a noun or a verb, take the word research
The government will invest on research.
The goal is to research new techniques of POS-tagging.
Most dictionaries will have more than one definition of research, example:
research as a noun
research as a verb
Where do these words come from, can you maybe pos-tag them within the context where they occur?
As #Triplee and #DavidBatista pointed out, it is really complicated to find out if a word is a noun or a verb only by itself, because in most languages, the syntax of a word depends on context.
Words are just representations of meanings. Because of that I'd like to add another proposition that might fit what you mean - instead of trying to find out if a words is a noun or a verb, try to find out if a Concept is an Object or an Action - this still has the problem of ambiguity, because a concept can carry both the Action or Object form.
However, you can stick to Concepts that only has object properties (such as TypeOf, HasAsPart, IsPartOf, etc) or Concepts that have both object and action properties (action properties are such as Subevents, Effects, Requires).
A good tool for Concept Searching is Conceptnet, it provides a WebApi to search for concepts in its network by keyword (it is based of Wikipedia and many other sites and is very complete for english language), is open and also points to synonyms in other languages (that are tagged as their common POS - you could average the POS of the synonyms to try to find out if the word is an object [noun-like] or an action [verb-like]).
If I wanted to make a NLP Toolkit like NLTK, which features would I implement first after tokenisation and normalisation. POS Tagging or Lemmatisation?
Part of speech is important for lemmatisation to work, as words which have different meanings depending on part of speech. And using this information, lemmatization will return the base form or lemma. So, it would be better if POS Tagging implementation is done first.
The main idea behind lemmatisation is to group different inflected forms of a word into one. For example, go, going, gone and went will become just one - go. But to derive this, lemmatisation would have to know the context of a word - whether the word is a noun or verb etc.
So, the lemmatisation function can take the word and the part of speech as input and return the lemma after processing the information.
Sure make the POS Tagger first. If you do lemmatisation first you could lose the best possible classification of words when doing the POS Tagger, especially in languages where ambiguity is commonplace, as it is in Portuguese.
I am trying to come up with some rules to detect named entities, specifically company or organization names in text. I think it makes sense to focus on verbs. There are a lot of POS Taggers that can easily detect proper nouns. I personally like StanfordPOSTagger. Now, once i have the proper noun, i know that it is a named entity. However, to be certain that it is the name of a company, i need to come up with rules and possibly Gazetteers
I was thinking of focusing on verbs. Is there a set of common verbs that occur frequently around company names?
I could create an annotated corpus and explicitly train a Machine Learning classifier to predict such verbs, but that is a LOT of work. It would be great if someone has already done some research on this.
Additionally, can some other POS tags give clues? Not just verbs.
The verbs approach seems the most promising. I've been working on something myself to identify sentient beings from folktales. See more about my approach here: http://www.aaai.org/ocs/index.php/INT/INT7/paper/viewFile/9253/9204
You may still need to do some annotations and training OR use web text and the method below to find the training data.
If you are looking for real companies (i.e. non-fictional), then I'd suggest you just extract referring expressions (i.e. nouns and also multi-word expressions) and then check against an online database (some with easy to use API) like:
https://angel.co/api (startups)
https://data.crunchbase.com/
http://www.metabase.com/
http://www.opencalais.com/ (paid options)
http://wiki.dbpedia.org/ (wikipedia)
Does the Stanford NER system fit this use-case? It already detects organizations, alongside people and other named entity types.
Is there a method in NLTK to be able to find certain adjective attributes that describe the word? For example, if I typed in the word "Skyscraper", attributes such as 'tall', 'structured', etc. would appear. I'm more so interested in the reverse, where if I type in the word 'tall' then it will list the semantic relations with other words.
I believe the attribute method on NLTK is meant for this, but it doesn't work particularly the way I described above and this is the code that I'm using for it:
from nltk.corpus import wordnet as wn
synsets = wn.synsets('skyscraper')
print[str(syns.attributes()) for syns in synsets]
I've tried using the part_meronyms and attributes methods, but this doesn't always result the adjective attributes of a word. I know of other Python tools that would allow me to do this, but I would prefer to use only NLTK as of now.
Using purely NLTK, you can achieve this as a two-step process, with your own functions.
Basic Idea
Step 1. Find all meaningful collocations for your target word ("skyscraper" or "tall")
Step 2. For the adjectives identified in those collocations that are of interest to you, parse the POS to get the semantic relations.
For Step 1. this SO question on Scoring bigrams has defs that are very relevant. You'll have to tweak the BigramAssocMeasures to your problem. (It uses the brown corpus, but you can use many others.)
For Step 2. you could use something like pos_tag() or even Tree.parse() to get the associations that you are looking for to your target adjective.
For a (simpler) and alternative approach, this link has examples of text.similar() that should be relevant.
Hope that helps.