Related
First of all, my apologies if I am not following some of the best practices of this site, as you will see, my home is mostly MSE (math stack exchange).
I am currently working on a project where I build a vacation recommendation system. The initial idea was somewhat akin to 20 questions: We ask the user certain questions, such as "Do you like museums?", "Do you like architecture", "Do you like nightlife" etc., and then based on these answers decide for the user their best vacation destination. We answer these questions based on keywords scraped from websites, and the decision tree we would implement would allow us to effectively determine the next question to ask a user. However, we are having some difficulties with the implementation. Some examples of our difficulties are as follows:
There are issues with granularity of questions. For example, to say that a city is good for "nature-lovers" is great, but this does not mean much. Nature could involve say, hot, sunny and wet vacations for some, whereas for others, nature could involve a brisk hike in cool woods. Fortunately, the API we are currently using provides us with a list of attractions in a city, down to a fairly granular level (for example, it distinguishes between different watersport activities such as jet skiing, or white water rafting). My question is: do we need to create some sort of hiearchy like:
nature-> (Ocean,Mountain,Plains) (Mountain->Hiking,Skiing,...)
or would it be best to simply include the bottom level results (the activities themselves) and just ask questions regarding those? I only ask because I am unfamiliar with exactly how the classification is done and the final output produced. Is there a better sort of structure that should be used?
Thank you very much for your help.
I think using a decision tree is a great idea for this problem. It might be an idea to group your granular activities, and for the "nature lovers" category list a number of different climate types: Dry and sunny, coastal, forests, etc and have subcategories within them.
For the activities, you could make a category called watersports, sightseeing, etc. It sounds like your dataset is more granular than you want your decision tree to be, but you can just keep dividing that granularity down into more categories on the tree until you reach a level you're happy with. It might be an idea to include images too, of each place and activity. Maybe even without descriptive text.
Bins and sub bins are a good idea, as is the nature, ocean_nature thing.
I was thinking more about your problem last night, TripAdvisor would be a good idea. What I would do is, take the top 10 items in trip advisor and categorize them by type.
Or, maybe your tree narrows it down to 10 cities. You would rank those cities according to popularity or distance from the user.
I’m not sure how to decide which city would be best for watersports, etc. You could even have cities pay to be top of the list.
How would you set up the following class structure, and why?
Class Person represents a Person
Class Room represents a Room
Where I would like to store the Room location of a Person, with the ability to efficiently ask the following two questions:
1) What Room is Person X in? and,
2) Which Persons are in Room Y?
The options I can see are:
Class Person stores the Room location (question 2 would be inefficient),
Class Room stores its Persons (question 1 would be inefficient),
Both of the above (opportunity for data integrity issues) and,
External dictionary holds the data (seems against the spirit of OOP)
External dictionary holds the data (seems against the spirit of OOP)
Or maybe you can implement all approaches at once: you define both associations Person has a Room and Room has many persons, and also you index everything using a proper data structure to access your objects at the light speed!
That is, you can walk across your object graph using object-oriented associations (i.e. composition) and you implement some specific requirements storing your graph in some data structures to fulfill your requirement.
This isn't against the spirit of OOP. It's in favor of good coding practices instead of artificial dogmas.
Not every problem is a good fit for OOP. This is one of them. :-)
What you basically have here is a two-dimensional grid or matrix.
One axis represents the persons, the other represents the room. The restriction is that a person can only be in one room.
A list of (person, room) tuples seems like a good way to represent this in Python. Here's an example (in IPython), with your two queries:
In [1]: locations = [('alfred', 'office'), ('ben', 'storage'), ('charlene', 'office'), ('deborah', 'factory')]
In [2]: [room for person, room in locations if person == 'charlene']
Out[2]: ['office']
In [3]: [person for person, room in locations if room == 'office']
Out[3]: ['alfred', 'charlene']
The queries are list comprehensions so they are pretty fast.
Unless your list of locations is huge, I don't think performance will be a big problem.
Note that this solution will work well with empty rooms and persons that are not present.
In [4]: [room for person, room in locations if person == 'zeke']
Out[4]: []
In [5]: [person for person, room in locations if room == 'maintenance']
Out[5]: []
You will have to take care not to insert the same person twice.
Definately an external structure. That is the spirit of OOP. Classes represent objects. Neither a person is a property of a room, nor a room a property of a person, so there is no reason putting one in another's class. Instead, the dictionary represents a relationship between them, so it should be a new entity.
A simple format would be a dictionary where each key is a room and the value is a list of the people inside it. However, you could probably create a class around it if you need more complex functionality
I was trying to build an entity resolution system, where my entities are,
(i) General named entities, that is organization, person, location,date, time, money, and percent.
(ii) Some other entities like, product, title of person like president,ceo, etc.
(iii) Corefererred entities like, pronoun, determiner phrase,synonym, string match, demonstrative noun phrase, alias, apposition.
From various literature and other references, I have defined its scope as I would not consider the ambiguity of each of the entity beyond its entity category. That is, I am taking Oxford of Oxford University
as different from Oxford as place, as the previous one is the first word of an organization entity and second one is the entity of location.
My task is to construct one resolution algorithm, where I would extract
and resolve the entities.
So, I am working out an entity extractor in the first place.
In the second place, if I try to relate the coreferences as I found from
various literatures like this seminal work, they are trying to work out
a decision tree based algorithm, with some features like, distance,
i-pronoun, j-pronoun, string match, definite noun
phrase, demonstrative noun phrase, number agreement feature,
semantic class agreement, gender agreement, both proper names, alias, apposition
etc.
The algorithm seems a nice one where enities are extracted with Hidden Markov Model(HMM).
I could work out one entity recognition system with HMM.
Now I am trying to work out a coreference as well as an entity
resolution system. I was trying to feel instead of using so many
features if I use an annotated corpus and train it directly with
HMM based tagger, with a view to solve a relationship extraction like,
*"Obama/PERS is/NA delivering/NA a/NA lecture/NA in/NA Washington/LOC, he/PPERS knew/NA it/NA was/NA going/NA to/NA be/NA
small/NA as/NA it/NA may/NA not/NA be/NA his/PoPERS speech/NA as/NA Mr. President/APPERS"
where, PERS-> PERSON
PPERS->PERSONAL PRONOUN TO PERSON
PoPERS-> POSSESSIVE PRONOUN TO PERSON
APPERS-> APPOSITIVE TO PERSON
LOC-> LOCATION
NA-> NOT AVAILABLE*
would I be wrong? I made an experiment with around 10,000 words. Early results seem
encouraging. With a support from one of my colleague I am trying to insert some
semantic information like,
PERSUSPOL, LOCCITUS, PoPERSM, etc. for PERSON OF US IN POLITICS, LOCATION CITY US, POSSESSIVE PERSON MALE, in the tagset to incorporate entity disambiguation at one go. My feeling relationship extraction would be much better now.
Please see this new thought too.
I got some good results with Naive Bayes classifier also where sentences
having predominately one set of keywords are marked as one class.
If any one may suggest any different approach, please feel free to suggest so.
I use Python2.x on MS-Windows and try to use libraries like NLTK, Scikit-learn, Gensim,
pandas, Numpy, Scipy etc.
Thanks in Advance.
It seems that you are going in three different paths that are totally different and each can be done in a stand alone Phd. There are many literature about them. My first advice focus on the main task and outsource the remaining. If you are going to develop this for non-famous language, also, you can build on others.
Named Entity Recognition
Standford NLP have really go too far in that specially for English. They resolve named entities really good, they are widely used and have a nice community.
Other solution may exist in openNLP for python .
Some tried to extend it to unusual fine-grain types but you need much bigger training data to cover the cases and the decision becomes much harder.
Edit: Stanford NER exists in NLTK python
Named Entity Resolution/Linking/Disambiguation
This is concerned with linking the name to some knowledge base, and solves the problem of whether Oxford University of Oxford City.
AIDA: is one of the state-of-art in that. They uses different context information as well as coherence information. Also, they have tried supporting several languages. They have a good bench mark.
Babelfy: offers interesting API that does NER and NED for Entities and concepts. Also, they support many language but never worked very well.
others like tagme and wikifi ...etc
Conference Resolution
Also Stanford CoreNLP has some good work in that direction. I can also recommend this work where they combined Conference Resolution with NED.
Sorry for that weird "question title" , but I couldnt think of an appropriate title.
Im new to NLP concepts, so I used NER demo (http://cogcomp.cs.illinois.edu/demo/ner/results.php). Now the issue is that "how & in what ways" can I use these taggings done by NER. I mean these what answers or inferences can one draw from these named-entities which have been tagged in certain groups - location, person ,organization etc. If I have a data which has names of entirely new companies, places etc then how am I going to do these NER taggings for such a data ?
Pls dont downvote or block me, I just need guidance/expert suggestions thats it. Reading about a concept is another thing, while being able to know where & when to apply it is another thing, which is where Im asking for guidance. Thanks a ton !!!
A snippet from the demo:-
Dogs have been used in cargo areas for some time, but have just been introduced recently in
passenger areas at LOC Newark and LOC JFK airports. LOC JFK has one dog and LOC Newark has a
handful, PER Farbstein said.
Usually NER is a step in a pipeline. For example, once all entities have been tagged, if you have many sentences like [PER John Smith], CEO of [ORG IBM] said..., then you can set up a table of Companies and CEOs. This is a form of knowledge base population.
There are plenty of other uses, though, depending on the type of data you already have and what you are trying to accomplish.
I think there are two parts in your question:
What is the purpose of NER?
This is vast question, generally it is used for Information Retrieval (IR) tasks such as indexing, document classification, Knowledge Base Population (KBP) but also many, many others (speech recognition, translation)... quite hard to figure out an extensive list...
How can we NER be extended to also recognize new/unkown entities?
E.g. how can we recognize entities that have never been seen by the NER system. In a glance, two solutions are likely to work:
Let's say you have some linked database that is updated on a regular basis: than the system may rely on generic categories. For instance, let's say "Marina Silva" comes up in news and is now added to lexicon associated to category "POLITICIAN". As the system knows that every POLITICIAN should be tagged as a person, i.e. doesn't rely on lexical items but on categories, and will thus tag "Marina Silva" as a PERS named entity. You don't have to re-train the whole system, just to update its lexicon.
Using morphological and contextual clues, the system may guess for new named entities that have never been seen (and are not in the lexicon). For instance, a rule like "The presidential candidate XXX YYY" (or "Marina YYY") will guess that "XXX YYY" (or just "YYY") is a PERS (or part of a PERS). This involves, most of the times, probabilistic modeling.
Hope this helps :)
For example...
Chicken is an animal.
Burrito is a food.
WordNet allows you to do "is-a"...the hiearchy feature.
However, how do I know when to stop travelling up the tree? I want a LEVEL.
That is consistent.
For example, if presented with a bunch of words, I want wordNet to categorize all of them, but at a certain level, so it doesn't go too far up. Categorizing "burrito" as a "thing" is too broad, yet "mexican wrapped food" is too specific. I want to go up the hiearchy or down..until the right LEVEL.
WordNet is a lexicon rather than an ontology, so 'levels' don't really apply.
There is SUMO, which is an upper ontology which relates to WordNet if you want a directed lattice instead of a network.
For some domains, SUMO's mid-level ontology is probably where you want to look, but I'm not sure it has 'mexican wrapped food', as most of its topics are scientific or engineering.
WordNet's hierarchy is
beef burrito < burrito < dish/2 < victuals < food < substance < entity.
Entity is a top-level concept, so if you stop one-below substance you'll get burrito isa food. You can calculate a level based on that, but it wont' necessarily be as consistent as SUMO, or generate your own set of useful mid-level concepts to terminate at. There is no 'mexican wrapped food' step in WordNet.
[Please credit Pete Kirkham, he first came with the reference to SUMO which may well answer the question asked by Alex, the OP]
(I'm just providing a complement of information here; I started in a comment field but soon ran out of space and layout capabilites...)
Alex: Most of SUMO is science or engineering? It does not contain every-day words like foods, people, cars, jobs, etc?
Pete K: SUMO is an upper ontology. The mid-level ontologies (where you would find concepts between 'thing' and 'beef burrito') listed on the page don't include food, but reflect the sorts of organisations which fund the project. There is a mid-level ontology for people. There's also one for industries (and hence jobs), including food suppliers, but no mention of burritos if you grep it.
My two cents
100% of WordNet (3.0 i.e. the latest, as well as older versions) is mapped to SUMO, and that may just be what Alex need. The mid-level ontologies associated with SUMO (or rather with MILO) are effectively in specific domains, and do not, at this time, include Foodstuff, but since WordNet does (include all -well, many of- these everyday things) you do not need to leverage any formal ontology "under" SUMO, but instead use Sumo's WordNet mapping (possibly in addition to WordNet, which, again, is not an ontology but with its informal and loose "hierarchy" may also help.
Some difficulty may arise, however, from two area (and then some ;-) ?):
the SUMO ontology's "level" may not be the level you'd have in mind for your particular application. For example while "Burrito" brings "Food", at top level entity in SUMO "Chicken" brings well "Chicken" which only through a long chain finds "Animal" (specifically: Chicken->Poultry->Bird->Warm_Blooded_Vertebrae->Vertebrae->Animal).
Wordnet's coverage and metadata is impressive, but with regards to the mid-level concepts can be a bit inconsistent. For example "our" Burrito's hypernym is appropriately "Dish", which provides it with circa 140 food dishes, which includes generics such as "Soup" or "Casserole" as well as "Chicken Marengo" (but omitting say "Chicken Cacciatore")
My point, in bringing up these issues, is not to criticize WordNet or SUMO and its related ontologies, but rather to illustrate simply some of the challenges associated with building ontology, particularly at the mid-level.
Regardless of some possible flaws and lackings of a solution based on SUMO and WordNet, a pragmatic use of these frameworks may well "fit the bill" (85% of the time)
In order to get levels, you need to predefine the content of each level. An ontology often defines these as the immediate IS_A children of a specific concept, but if that is absent, you need to develop a method of that yourself.
The next step is to put a priority on each concept, in case you want to present only one category for each word. The priority can be done in multiple ways, for instance as the count of IS_A relations between the category and the word, or manually selected priorities for each category. For each word, you can then pick the category with the highest priority. For instance, you may want meat to be "food" rather than chemical substance.
You may also want to pick some words, that change priority if they are in the path. For instance, if you want some chemicals which are also food, to be announced as chemicals, but others should still be food.
WordNet's hypernym tree ends with a single root synset for the word "entity". If you are using WordNet's C library, then you can get a while recursive structure for a synset's ancestors using traceptrs_ds, and you can get the whole synset tree by recursively following nextss and ptrlst pointers until you hit null pointers.
sorry, may I ask which tool could judge "difficulty level" of sentences?
I wish to find out "similar difficulty level" of sentences for user to read.