For example...
Chicken is an animal.
Burrito is a food.
WordNet allows you to do "is-a"...the hiearchy feature.
However, how do I know when to stop travelling up the tree? I want a LEVEL.
That is consistent.
For example, if presented with a bunch of words, I want wordNet to categorize all of them, but at a certain level, so it doesn't go too far up. Categorizing "burrito" as a "thing" is too broad, yet "mexican wrapped food" is too specific. I want to go up the hiearchy or down..until the right LEVEL.
WordNet is a lexicon rather than an ontology, so 'levels' don't really apply.
There is SUMO, which is an upper ontology which relates to WordNet if you want a directed lattice instead of a network.
For some domains, SUMO's mid-level ontology is probably where you want to look, but I'm not sure it has 'mexican wrapped food', as most of its topics are scientific or engineering.
WordNet's hierarchy is
beef burrito < burrito < dish/2 < victuals < food < substance < entity.
Entity is a top-level concept, so if you stop one-below substance you'll get burrito isa food. You can calculate a level based on that, but it wont' necessarily be as consistent as SUMO, or generate your own set of useful mid-level concepts to terminate at. There is no 'mexican wrapped food' step in WordNet.
[Please credit Pete Kirkham, he first came with the reference to SUMO which may well answer the question asked by Alex, the OP]
(I'm just providing a complement of information here; I started in a comment field but soon ran out of space and layout capabilites...)
Alex: Most of SUMO is science or engineering? It does not contain every-day words like foods, people, cars, jobs, etc?
Pete K: SUMO is an upper ontology. The mid-level ontologies (where you would find concepts between 'thing' and 'beef burrito') listed on the page don't include food, but reflect the sorts of organisations which fund the project. There is a mid-level ontology for people. There's also one for industries (and hence jobs), including food suppliers, but no mention of burritos if you grep it.
My two cents
100% of WordNet (3.0 i.e. the latest, as well as older versions) is mapped to SUMO, and that may just be what Alex need. The mid-level ontologies associated with SUMO (or rather with MILO) are effectively in specific domains, and do not, at this time, include Foodstuff, but since WordNet does (include all -well, many of- these everyday things) you do not need to leverage any formal ontology "under" SUMO, but instead use Sumo's WordNet mapping (possibly in addition to WordNet, which, again, is not an ontology but with its informal and loose "hierarchy" may also help.
Some difficulty may arise, however, from two area (and then some ;-) ?):
the SUMO ontology's "level" may not be the level you'd have in mind for your particular application. For example while "Burrito" brings "Food", at top level entity in SUMO "Chicken" brings well "Chicken" which only through a long chain finds "Animal" (specifically: Chicken->Poultry->Bird->Warm_Blooded_Vertebrae->Vertebrae->Animal).
Wordnet's coverage and metadata is impressive, but with regards to the mid-level concepts can be a bit inconsistent. For example "our" Burrito's hypernym is appropriately "Dish", which provides it with circa 140 food dishes, which includes generics such as "Soup" or "Casserole" as well as "Chicken Marengo" (but omitting say "Chicken Cacciatore")
My point, in bringing up these issues, is not to criticize WordNet or SUMO and its related ontologies, but rather to illustrate simply some of the challenges associated with building ontology, particularly at the mid-level.
Regardless of some possible flaws and lackings of a solution based on SUMO and WordNet, a pragmatic use of these frameworks may well "fit the bill" (85% of the time)
In order to get levels, you need to predefine the content of each level. An ontology often defines these as the immediate IS_A children of a specific concept, but if that is absent, you need to develop a method of that yourself.
The next step is to put a priority on each concept, in case you want to present only one category for each word. The priority can be done in multiple ways, for instance as the count of IS_A relations between the category and the word, or manually selected priorities for each category. For each word, you can then pick the category with the highest priority. For instance, you may want meat to be "food" rather than chemical substance.
You may also want to pick some words, that change priority if they are in the path. For instance, if you want some chemicals which are also food, to be announced as chemicals, but others should still be food.
WordNet's hypernym tree ends with a single root synset for the word "entity". If you are using WordNet's C library, then you can get a while recursive structure for a synset's ancestors using traceptrs_ds, and you can get the whole synset tree by recursively following nextss and ptrlst pointers until you hit null pointers.
sorry, may I ask which tool could judge "difficulty level" of sentences?
I wish to find out "similar difficulty level" of sentences for user to read.
Related
import gensim.downloader as api
glove = api.load('glove-wiki-gigaword-200')
I first ran this code to download the pre-trained model.
glove.most_similar(positive=['sushi', 'uae'], negative=['japan'])
would then result in:
[('nahyan', 0.5181387066841125),
('caviar', 0.4778318405151367),
('paella', 0.4497394263744354),
('nahayan', 0.44313961267471313),
('zayed', 0.4321245849132538),
('omani', 0.4285220503807068),
('seafood', 0.4279175102710724),
('saif', 0.426000714302063),
('dirham', 0.4214130640029907),
('sashimi', 0.4165934920310974)]
and in this example, we can see that the method failed to capture the 'type' or 'category' of the query. 'zayed', 'nahyan' are not actually of 'type' food and rather they represent person name.
The approach suggested by my professor is to use wordnet hypernyms to find the 'type'.
With much research, the closest solution I found is to somehow incorporate
lowest_common_hypernyms() that will give the lowest common hypernym between two synsets and use it to filter the results of most_similar().
I am not sure if my idea make sense and would like the community feedback on this.
My idea is compute the hypernym of, e.g. 'sushi' and the hypernyms of all the similar words returned by most_similar() and only choose the word with 'longest' lowest common hypernym path. I expect this should return the word that best matches the 'type'
Not sure if it makes sense...
Does your proposed approach give adequate results when you try it?
That's the only test of whether the idea makes sense.
Word2vec is generally oblivious to the all the variations of category that a lexicon like WordNet can provide – all the words that are similar to another word, in any aspect, will be neighbors. Even words that people consider opposites – like 'hot' and 'cold' – will be often be fairly close to each other, in some direction in the coordinate space, as they are similar in what they describe and what contexts they're used in. (They can be drop-in replacements for each other.)
Word2vec is also fairly oblivious to polysemy in its standard formulation.
Some other things worth trying might be:
if you need only answers of a certain type, mix-in some measurement ranking candidate answers by their closeness to a word either describing that type ('food') or representing multiple examples (say an average vector for many food-names you'd know to be good answers)
choose another vector-set, or train your own. There's no universal "goodness" for word-vectors: their quality for certain tasks will vary based on their training data & parameters. Vectors trained on something broader than Wikipedia (your named vector file), or some text corpus more focused on your domain-of-interest – say, food criticism – might do better on some tasks. Changing training parameters can also change which kinds of similarity are most emphasized in the resulting vectors. For example, some observers have noticed small context-windows tend to put words that are direct drop-in replacements for each other closer-together, while larger context-windows bring words from the same domains-of-use, even if not drop-in replacements of the same 'type', closer. (It sounds like your current need might be best served with a model trained with smaller windows.)
Nahyan is from the UAE - it seems to be part of the name of all three presidents. So you seem to be getting what you ask for. If you want more foods, add "food" to your positive query, and maybe "people" to your negative query?
Another approach is to post-filter your results to remove anything that isn't a food. Or is a person. (WordNet won't be much help, as it is nowhere near comprehensive on foods, and even less so on people; Wikidata is likely to be more useful.)
By the way, if you find the common hypernym of sushi and UAE it will probably be the top-level entity in wordnet. So that will give you no filtering.
I have been exploring NLP techniques with the goal of identifying the subject of survey comments (which I then use in conjunction with sentiment analysis). I want to make high level statements such as "10% of survey respondents made a positive comment (+ sentiment) about Account Managers".
My approach has used Named Entity Recognition (NER). Now that I am working with real data, I am getting visibility of some of the complexities & nuances associated with identifying the subject of a sentence. Here are 5 examples of sentences where the subject is the Account Manager. I have put the named entity in bold for demonstration purposes.
Our account manager is great, he always goes the extra mile!
Steve our account manager is great, he always goes the extra mile!
Steve our relationship manager is great, he always goes the extra
mile!
Steven is great, he always goes the extra mile!
Steve Smith is great, he always goes the extra mile!
Our business mgr. is great,he always goes the extra mile!
I see three challenges that add complexity to my task
Synonyms: Account manager vs relationship manager vs business mgr. This is somewhat domain specific and tends to vary with the survey target audience.
Abbreviations: Mgr. vs manager
Ambiguity - Whether “Steven” is “Steve Smith” & therefore an
“account manager”.
Of these the synonym problem is the most frequent issue, followed by the ambiguity issues. Based on what I have seen, the abbreviation issue isn’t that frequent in my data.
Are there any NLP techniques that can help deal with any of these issues to a relatively high degree of confidence?
As far as I understood, what you call the "subject" is, given a sentence, the entity that a statement is made about - in your example, Steve the account manager.
Based on this assumption, here are a few techniques and how they might help you:
(Dependency) Parsing
Since you don't mean subject in the strict grammatical sense, the approach suggested by user7344209 based on dependency parsing probably won't help you. In a sentence such as "I like Steve", the grammatical subject is "I", although you probably want to find "Steve" as the "subject".
Named Entity Recognition
You already use this, and it will be great to detect names of persons such as Steve. What I'm not so sure about is the example of the "account manager". Both the output provided by Daniel and my own test with Stanford CoreNLP did not identify it as a named entity - which is correct, it really is not a named entity:
Something broader such as the suggested mention identification might be better, but it basically marks every noun phrase which is probably too broad. If I understood it correctly, you want to find one subject per sentence.
Coreference Resolution
Coreference Resolution is the key technique to detect that "Steve" and the "account manager" are the same entity. Stanford CoreNLP has such module for example.
In order for this to work in your example, you have to let it process several sentence at once, since you want to find the links between them. Here is an example with (shorted versions) of some of your examples:
The visualization is a bit messy, but it basically found the following coreference chains:
Steve <-> Steve Smith
Steve our account manager <-> He <-> Our account manager
Our <-> Our
the extra mile <-> the extra mile
Given the first two chains, and a bit of post-processing, you could figure out that all four statements are about the same entity.
Semantic Similarity
In the case of account, business and relationship manager, I found that the CoreNLP coreference resolver actually already finds chains despite the different terms.
More generally, if you think that the coreference resolver cannot handle synonyms and paraphrases well enough, you could also try to include measures of semantic similarity. There is a lot of work in NLP on predicting whether two phrases are synonymous or not.
Some approaches are:
Looking up synonyms in a thesaurus such as Wordnet - e.g. with nltk (python) as shown here
Better, compute a similarity measure based on the relationships defined in WordNet - e.g. using SEMILAR (Java)
Using continous representations for words to compute similarities, for example based on LSA or LDA - also possible with SEMILAR
Using more recent neural-network-style word embeddings such as word2vec or GloVe - the latter are easily usable with spacy (python)
An idea to use these similarity measures would be to identify entities in two sentences, then make pairwise comparisons between entities in both sentences and if a pair has a similarity higher than a threshold consider it as beeing the same entity.
If you don't have much data to train, you probably can try a dependency analysis tool and extract dependency pairs which have SUBJECT identified (usually the nsubj if you use Stanford Parser).
I like your approach using NER. This is what I see in our system for your inputs:
Mention-Detection output might also be useful:
On your 2nd point, which involves abbreviations, it is a hard problem. But we have entity-similarity module here that might be useful. This takes into account things like honorifics etc.
About your 3rd point, co-reference problem, try the coref module:
Btw the above figures are from the demo here: http://deagol.cs.illinois.edu:8080
I was trying to build an entity resolution system, where my entities are,
(i) General named entities, that is organization, person, location,date, time, money, and percent.
(ii) Some other entities like, product, title of person like president,ceo, etc.
(iii) Corefererred entities like, pronoun, determiner phrase,synonym, string match, demonstrative noun phrase, alias, apposition.
From various literature and other references, I have defined its scope as I would not consider the ambiguity of each of the entity beyond its entity category. That is, I am taking Oxford of Oxford University
as different from Oxford as place, as the previous one is the first word of an organization entity and second one is the entity of location.
My task is to construct one resolution algorithm, where I would extract
and resolve the entities.
So, I am working out an entity extractor in the first place.
In the second place, if I try to relate the coreferences as I found from
various literatures like this seminal work, they are trying to work out
a decision tree based algorithm, with some features like, distance,
i-pronoun, j-pronoun, string match, definite noun
phrase, demonstrative noun phrase, number agreement feature,
semantic class agreement, gender agreement, both proper names, alias, apposition
etc.
The algorithm seems a nice one where enities are extracted with Hidden Markov Model(HMM).
I could work out one entity recognition system with HMM.
Now I am trying to work out a coreference as well as an entity
resolution system. I was trying to feel instead of using so many
features if I use an annotated corpus and train it directly with
HMM based tagger, with a view to solve a relationship extraction like,
*"Obama/PERS is/NA delivering/NA a/NA lecture/NA in/NA Washington/LOC, he/PPERS knew/NA it/NA was/NA going/NA to/NA be/NA
small/NA as/NA it/NA may/NA not/NA be/NA his/PoPERS speech/NA as/NA Mr. President/APPERS"
where, PERS-> PERSON
PPERS->PERSONAL PRONOUN TO PERSON
PoPERS-> POSSESSIVE PRONOUN TO PERSON
APPERS-> APPOSITIVE TO PERSON
LOC-> LOCATION
NA-> NOT AVAILABLE*
would I be wrong? I made an experiment with around 10,000 words. Early results seem
encouraging. With a support from one of my colleague I am trying to insert some
semantic information like,
PERSUSPOL, LOCCITUS, PoPERSM, etc. for PERSON OF US IN POLITICS, LOCATION CITY US, POSSESSIVE PERSON MALE, in the tagset to incorporate entity disambiguation at one go. My feeling relationship extraction would be much better now.
Please see this new thought too.
I got some good results with Naive Bayes classifier also where sentences
having predominately one set of keywords are marked as one class.
If any one may suggest any different approach, please feel free to suggest so.
I use Python2.x on MS-Windows and try to use libraries like NLTK, Scikit-learn, Gensim,
pandas, Numpy, Scipy etc.
Thanks in Advance.
It seems that you are going in three different paths that are totally different and each can be done in a stand alone Phd. There are many literature about them. My first advice focus on the main task and outsource the remaining. If you are going to develop this for non-famous language, also, you can build on others.
Named Entity Recognition
Standford NLP have really go too far in that specially for English. They resolve named entities really good, they are widely used and have a nice community.
Other solution may exist in openNLP for python .
Some tried to extend it to unusual fine-grain types but you need much bigger training data to cover the cases and the decision becomes much harder.
Edit: Stanford NER exists in NLTK python
Named Entity Resolution/Linking/Disambiguation
This is concerned with linking the name to some knowledge base, and solves the problem of whether Oxford University of Oxford City.
AIDA: is one of the state-of-art in that. They uses different context information as well as coherence information. Also, they have tried supporting several languages. They have a good bench mark.
Babelfy: offers interesting API that does NER and NED for Entities and concepts. Also, they support many language but never worked very well.
others like tagme and wikifi ...etc
Conference Resolution
Also Stanford CoreNLP has some good work in that direction. I can also recommend this work where they combined Conference Resolution with NED.
Sorry for that weird "question title" , but I couldnt think of an appropriate title.
Im new to NLP concepts, so I used NER demo (http://cogcomp.cs.illinois.edu/demo/ner/results.php). Now the issue is that "how & in what ways" can I use these taggings done by NER. I mean these what answers or inferences can one draw from these named-entities which have been tagged in certain groups - location, person ,organization etc. If I have a data which has names of entirely new companies, places etc then how am I going to do these NER taggings for such a data ?
Pls dont downvote or block me, I just need guidance/expert suggestions thats it. Reading about a concept is another thing, while being able to know where & when to apply it is another thing, which is where Im asking for guidance. Thanks a ton !!!
A snippet from the demo:-
Dogs have been used in cargo areas for some time, but have just been introduced recently in
passenger areas at LOC Newark and LOC JFK airports. LOC JFK has one dog and LOC Newark has a
handful, PER Farbstein said.
Usually NER is a step in a pipeline. For example, once all entities have been tagged, if you have many sentences like [PER John Smith], CEO of [ORG IBM] said..., then you can set up a table of Companies and CEOs. This is a form of knowledge base population.
There are plenty of other uses, though, depending on the type of data you already have and what you are trying to accomplish.
I think there are two parts in your question:
What is the purpose of NER?
This is vast question, generally it is used for Information Retrieval (IR) tasks such as indexing, document classification, Knowledge Base Population (KBP) but also many, many others (speech recognition, translation)... quite hard to figure out an extensive list...
How can we NER be extended to also recognize new/unkown entities?
E.g. how can we recognize entities that have never been seen by the NER system. In a glance, two solutions are likely to work:
Let's say you have some linked database that is updated on a regular basis: than the system may rely on generic categories. For instance, let's say "Marina Silva" comes up in news and is now added to lexicon associated to category "POLITICIAN". As the system knows that every POLITICIAN should be tagged as a person, i.e. doesn't rely on lexical items but on categories, and will thus tag "Marina Silva" as a PERS named entity. You don't have to re-train the whole system, just to update its lexicon.
Using morphological and contextual clues, the system may guess for new named entities that have never been seen (and are not in the lexicon). For instance, a rule like "The presidential candidate XXX YYY" (or "Marina YYY") will guess that "XXX YYY" (or just "YYY") is a PERS (or part of a PERS). This involves, most of the times, probabilistic modeling.
Hope this helps :)
I'm working on an application that attempts to match an input set of potentially "messy" entity names to "clean" entity names in a reference list. I've been working with edit distance and other common fuzzy matching algorithms, but I'm wondering if there are any better approaches that allow for term weighting, such that common terms are given less weight in the fuzzy match.
Consider this example, using Python's difflib library. I'm working with organization names, which have many standardized components in common and therefore cannot be used to differentiate among entities.
from difflib import SequenceMatcher
e1a = SequenceMatcher(None, "ZOECON RESEARCH INSTITUTE",
"LONDON RESEARCH INSTITUTE")
print e1a.ratio()
0.88
e1b = SequenceMatcher(None, "ZOECON", "LONDON")
print e1b.ratio()
0.333333333333
e2a = SequenceMatcher(None, "WORLDWIDE SEMICONDUCTOR MANUFACTURING CORP",
"TAIWAN SEMICONDUCTOR MANUFACTURING CORP")
print e2a.ratio()
0.83950617284
e2b = SequenceMatcher(None, "WORLDWIDE",
"TAIWAN")
print e2b.ratio()
0.133333333333
Both examples score highly on the full string because RESEARCH, INSTITUTE, SEMICONDUCTOR, MANUFACTURING, and CORP are high frequency, generic terms in many organization names. I'm looking for any ideas of how to integrate term frequencies into fuzzy string matching (not necessarily using difflib), such that the scores are't as influenced by common terms, and the results might look more like the "e1b" and "e2b" examples.
I realize I could just make a big "frequent term" list and exclude those from the comparison, but I'd like to use frequencies if possible because even common words add some information, and also the cutoff point for any list would of course also be arbitrary.
Here's a weird idea for you:
Compress your input and diff that.
You could use e.g. Huffman or dictionary coder to compress your input, that automatically takes care of common terms. It may not do so well for typos though, in your example, London is probably a relatively common word, while misspelt Lundon is not at all, and dissimilarity between compressed terms is much higher than between raw terms.
how about splitting each string into a list of words, and running your comparison on each word to get a list which holds the scores of word matches. then you can average the scores, find the lowest/highest indirect match or partials...
gives you the ability to add your own weight.
you would of course need to handle offsets like..
"the london company for leather"
and
"london company for leather"
In my opinion, a general solution will never match your idea of similarity. As soon as you have some implicit knowledge about your data, you have to put that somehow into code. Which imediately disqualifies a fixed existing solution.
Perhaps you should have look at http://nltk.org/ to get an idea of some NLP techniques. You don't tell us enough about your data, but a POS tagger might help to identify more and less relevant terms. Available databases with names of cities, countries, ... might help to clean up the data before processing it further.
There are many tools available, but to get high quality output, you will need a solution which is customized for your data and use case.
I am just proposing another different approach. Since you mentioned that the entity names are coming from a reference list, I am wondering if you have additional context information, like co-author names, product/paper titles, address w/ city,state,country?
If you do have some useful context as above, you can actually build a graph of entities out of the relations between them. Relations could be, for example:
Author-paper relation
Co-author relation
author-institute relation
institute-city relation
....
Then it's time to use a graph-based entity resolution approach described in detail at:
http://linqs.cs.umd.edu/basilic/web/Publications/2007/bhattacharya:tkdd07/bhattacharya-tkdd.pdf
http://drum.lib.umd.edu/bitstream/1903/4021/1/4758.pdf
The approach has a very good performance on co-author-paper domain.