I have a list of words and would like to keep only nouns.
This is not a duplicate of Extracting all Nouns from a text file using nltk
In the linked question a piece of text is processed. The accepted answer proposes a tagger. I'm aware of the different options for tagging text (nlkt, textblob, spacy), but I can't use them, since my data doesn't consist of sentences. I only have a list of individual words:
would
research
part
technologies
size
articles
analyzes
line
nltk has a wide selection of corpora. I found verbnet with a comprehensive list of verbs. But so far I didn't see anything similar for nouns. Is there something like a dictionary, where I can look up if a word is a noun, verb, adjective, etc ?
This could probably done by some online service. Microsoft translate for example returns a lot of information in their responses: https://learn.microsoft.com/en-us/azure/cognitive-services/translator/reference/v3-0-dictionary-lookup?tabs=curl
But this is a paid service. I would prefer a python package.
Regarding the ambiguity of words: Ideally I would like a dictionary that can tell me all the functions a word can have. "fish" for example is both noun and verb. "eat" is only verb, "dog" is only noun. I'm aware that this is not an exact science. A working solution would simply remove all words that can't be nouns.
Tried using wordnet?
from nltk.corpus import wordnet
words = ["would","research","part","technologies","size","articles","analyzes","line"]
for w in words:
syns = wordnet.synsets(w)
print(w, syns[0].lexname().split('.')[0]) if syns else (w, None)
You should see:
('would', None)
('research', u'noun')
('part', u'noun')
('technologies', u'noun')
('size', u'noun')
('articles', u'noun')
('analyzes', u'verb')
('line', u'noun')
You can run a POS tagger on individual fragments, it will have lower accuracy but I suppose that's already a given.
Ideally, find a POS tagger which reveals every possible reading for possible syntactic disambiguation later on in the processing pipeline. This will basically just pick out all the possible readings from the lexicon (perhaps with a probability) and let you take it from there.
Even if you use a dictionary, you will always have to deal with ambiguity, for example, the same word depending on the context can be a noun or a verb, take the word research
The government will invest on research.
The goal is to research new techniques of POS-tagging.
Most dictionaries will have more than one definition of research, example:
research as a noun
research as a verb
Where do these words come from, can you maybe pos-tag them within the context where they occur?
As #Triplee and #DavidBatista pointed out, it is really complicated to find out if a word is a noun or a verb only by itself, because in most languages, the syntax of a word depends on context.
Words are just representations of meanings. Because of that I'd like to add another proposition that might fit what you mean - instead of trying to find out if a words is a noun or a verb, try to find out if a Concept is an Object or an Action - this still has the problem of ambiguity, because a concept can carry both the Action or Object form.
However, you can stick to Concepts that only has object properties (such as TypeOf, HasAsPart, IsPartOf, etc) or Concepts that have both object and action properties (action properties are such as Subevents, Effects, Requires).
A good tool for Concept Searching is Conceptnet, it provides a WebApi to search for concepts in its network by keyword (it is based of Wikipedia and many other sites and is very complete for english language), is open and also points to synonyms in other languages (that are tagged as their common POS - you could average the POS of the synonyms to try to find out if the word is an object [noun-like] or an action [verb-like]).
Related
I am looking for ideas/thoughts on the following problem:
I am working with food ingredient data such as: milk, sugar, eggs, flour, may contain nuts
From such piece of text I want to be able to identify and extract phrases like may contain nuts, to preprocess them separately
These kinds of phrases can change quite a lot in terms of length and content. I thought of using NER taggers, but I don't know if they will do the job correctly as they are mainly used for identifying single-word entities...
Any ideas on what to use as a phrase-entity-recognition system? Also which package would you use? Cheers
IMHO NER (or model-based entity extraction in general) alone is a poor choice of methodology for this particular problem as it requires LOTS of manual annotation to do it right. Instead I suggest using Word2Vec (https://radimrehurek.com/gensim/models/word2vec.html) with phrasing (https://radimrehurek.com/gensim/models/phrases.html).
The idea is to have an unsupervised model containing phrases and their similarities which can then queried using some seed words to list all possible ingredients (e.g. "cat" produces similar words like "dog" or "rat"). Next step would be either to create dictionaries containing the ingredient words & phrases or try clustering the vocabulary of the model using cosine similarity between each word/phrase pair.
Now if you want to take things further you can always match your created dictionaries/clusters back to the corpus the W2V model was trained on and then train a custom entity recognition model using those matches as you now have annotated examples.
I believe this is a Multiword-Expression problem.
There are a few ways you can try to solve this:
Build a named entity recognition model (NER)
Search with Regex for a fixed set of known phrases
Chunking tokens with POS tags
Find collocations of tokens
Let's look at each of these
Build a named entity recognition model (NER)
Named Entity Recognition labels known spans of tokens an a entity type
For each input token you have to label it as part of a known named entity.
Eddy N PERSON
Bonte N PERSON
is V O
woordvoerder N O
van Prep O
diezelfde Pron O
Hogeschool N ORG
. Punc O
This is costly and requires a lot of time for labelling.
It is probably not a good choice for your task.
Search with Regex
This is not a bad idea, using some known phrases you could easily search input texts with minimal word boundaries for context.
import re
re.findall(r"\bmay contain nuts\b", text)
This would require you knowing all phrases you want to search for up front, and might not be possible.
Chunking tokens with POS tags
This could be a good intermediate step but could give many false positives.
You could do this my knowing the sequences of POS tags you expect
may MD
contain VB
nuts NNS
Then you could use chunking with the known tag sequence (MD, VB, NNS).
The problem is that you may not know this, and would have to capture many use cases. It will also capture many sequences which you wont want to capture (False Positive)
Find collocations of tokens
This is probably the best way, as it seems you are looking for a highly common sequences of words (tokens) in a corpus.
You can do this using:
Word2Vec Phrases
NLTK Collocations
Both do the same thing, they look for statistically common sequences of tokens which occur in a corpus.
That can then be used to extract the same collocation phrases from new texts.
It looks like your ingredient list is easy to split into a list. In that case you don't really need a sequence tagger; I wouldn't treat this problem as phrase extraction or NER. What I would do is train a classifier on different items in the list to label them as "food" or "non-food". You should be able to start with rules and train a basic classifier using anything really.
Before training a model, an even simpler step would be to run each list item through a PoS tagger (say spaCy), and if there's a verb you can guess that it's not a food item.
Hello i'm looking to find a solution of my issue :
I Want to find a list of similar words with french and english
For example :
name could be : first name, last name, nom, prénom, username....
Postal address could be : city, country, street, ville, pays, code postale ....
The other answer, and comments, describe how to get synonyms, but I think you want more than that?
I can suggest two broad approaches: WordNet and word embeddings.
Using nltk and wordnet, you want to explore the adjacent graph nodes. See http://www.nltk.org/howto/wordnet.html for an overview of the functions available. I'd suggest that once you've found your start word in Wordnet, follow all its relations, but also go up to the hypernym, and do the same there.
Finding the start word is not always easy:
http://wordnetweb.princeton.edu/perl/webwn?s=Postal+address&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=
Instead it seems I have to use "address": http://wordnetweb.princeton.edu/perl/webwn?s=address&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=
and then decide which of those is the correct sense here. Then try clicking the hypernym, hyponym, sister term, etc.
To be honest, none of those feels quite right.
Open Multilingual WordNet tries to link different languages. http://compling.hss.ntu.edu.sg/omw/ So you could take your English WordNet code, and move to the French WordNet with it, or vice versa.
The other approach is to use word embeddings. You find the, say, 300 dimensional, vector of your source word, and then hunt for the nearest words in that vector space. This will be returning words that are used in similar contexts, so they could be similar meaning, or similar syntactically.
Spacy has a good implementation, see https://spacy.io/usage/spacy-101#vectors-similarity and https://spacy.io/usage/vectors-similarity
Regarding English and French, normally you would work in the two languages independently. But if you search for "multilingual word embeddings" you will find some papers and projects where the vector stays the same for the same concept in different languages.
Note: the API is geared towards telling you how two words are similar, not finding similar words. To find similar words you need to take your vector and compare with every other word vector, which is O(N) in the size of the vocabulary. So you might want to do this offline, and build your own "synonyms-and-similar" dictionary for each word of interest.
from PyDictionary import PyDictionary
dictionary=PyDictionary()
answer = dictionary.synonym(word)
word is the word for which you are finding the synonyms.
I am working on NLP with python and my next step is to gather huge-huge data regarding specific topics available in English grammar.
For example : all words that can define a "Department" say "Accounts".
So can any tell me how I can gather such data (if possible, through any API).
NLTK wordnet is a great framework for these kind of problems. Here is a brief documentation:
http://www.nltk.org/howto/wordnet.html This uses things objects like "synset" which gives you words with common meanings. There are also ways to get a numerical score for the similarities of two words. Lemmas will give you words with similar root meanings.
If you are looking for more of a find related words (ex: "spaghetti" --> "pasta", "ravioli", "Italy" database is probably better:
https://www.datamuse.com/api/
I'm trying to find a package/library which enables me to determine to which noun a verb refers (in Python):
The man was walking down the street
That this will give me a result like walking refers to man.
I tried to use nltk however as far as I could find out I can only tag the wordst as nouns/verbs etc. but cannot infer any references from this.
Question
Are there any packages which are capable to do the above?
I think what you want to do is explore the syntactic dependencies among words. For that, you need to parse your text with a syntactic parser.
Since you want to know the references between nouns and verbs you will need to do two things.
You need to get the part-of-speech tags associated to each words (i.e. the morphology associated to each word, wether it's a ADJ, DET, VERB, NOUN, etc.) then you want to select the ones tagged as verbs and nouns.
Then, you want to look at which other words they connect with, I think mostly you will want to explore the 'nsubj' dependency.
spaCy is a NLP library for Python that performs syntactic parsing, and also has a on-line demo, if you want to try it out, check:
https://demos.explosion.ai/displacy/
Here is the output for the example you gave:
If I wanted to make a NLP Toolkit like NLTK, which features would I implement first after tokenisation and normalisation. POS Tagging or Lemmatisation?
Part of speech is important for lemmatisation to work, as words which have different meanings depending on part of speech. And using this information, lemmatization will return the base form or lemma. So, it would be better if POS Tagging implementation is done first.
The main idea behind lemmatisation is to group different inflected forms of a word into one. For example, go, going, gone and went will become just one - go. But to derive this, lemmatisation would have to know the context of a word - whether the word is a noun or verb etc.
So, the lemmatisation function can take the word and the part of speech as input and return the lemma after processing the information.
Sure make the POS Tagger first. If you do lemmatisation first you could lose the best possible classification of words when doing the POS Tagger, especially in languages where ambiguity is commonplace, as it is in Portuguese.