QA query system on corpus - python

We have a question - answer corpus like shown below
Q: Why did Lincoln issue the Emancipation Proclamation?
A: The goal was to weaken the rebellion, which was led and controlled by slave owners.
Q: Who is most noted for his contributions to the theory of molarity and molecular weight?
A: Amedeo Avogadro
Q: When did he drop John from his name?
A: upon graduating from college
Q: What do beetles eat?
A: Some are generalists, eating both plants and animals. Other beetles are highly specialised in their diet.
Consider question as queries and answers as documents.
We have to build a system that for a given query (semantically similar to one of the questions in the question corpus) be able to get the right document (answers in the answer corpus)
Can anyone suggest any algorithm or good way to proceed in building it.

Your question is too broad and the task you are trying to do is challenging. However, I suggest you to read about IR-based Factoid Question Answering. This document has reference to many state-of-art techniques. Reading this document should lead you to several ideas.
Please note that, you need to follow different approach for IR-based Factoid QA and knowledge-based QA. First, identify what type of QA system you want to build.
Lastly, I believe simple document matching technique for QA won't be enough. But you can try simple approach using Lucene #Debasis suggested and see whether it does well.

Consider a question and its answer (assuming there is only one) as one single document in Lucene. Lucene supports a field view of documents; so while constructing a document make question the searchable field. Once you retrieve the top ranked questions given a query question, use the get method of the Document class to return the answers.
A code skeleton (fill this up yourself):
//Index
IndexWriterConfig iwcfg = new IndexWriterConfig(new StandardAnalyzer());
IndexWriter writer = new IndexWriter(...);
....
Document doc = new Document();
doc.add(new Field("FIELD_QUESTION", questionBody, Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("FIELD_ANSWER", answerBody, Field.Store.YES, Field.Index.ANALYZED));
...
...
// Search
IndexReader reader = new IndexReader(..);
IndexSearcher searcher = new IndexSearcher(reader);
...
...
QueryParser parser = new QueryParser("FIELD_QUESTION", new StandardAnalyzer());
Query q = parser.parse(queryQuestion);
...
...
TopDocs topDocs = searcher.search(q, 10); // top-10 retrieved
// Accumulate the answers from the retrieved questions which
// are similar to the query (new) question.
StringBuffer buff = new StringBuffer();
for (ScoreDoc sd : topDocs.scoreDocs) {
Document retrievedDoc = reader.document(sd.doc);
buff.append(retrievedDoc.get("FIELD_ANSWER")).append("\n");
}
System.out.println("Generated answer: " + buff.toString());

Related

Finding full taxonomy (heirarchical hypernymy sequence) of a given DBpedia resource using SPARQL

Given a DBpedia resource, I want to find the entire taxonomy till root.
For example, if I were to say in plain English, for Barack Obama I want to know the entire taxonomy which goes as Barack Obama → Politician → Person → Being.
I have written the following recursive function for the same:
import requests
import json
from SPARQLWrapper import SPARQLWrapper, JSON
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
def get_taxonomy(results,entity,hypernym_list):
'''This recursive function keeps on fetching the hypernyms of the
DBpedia resource recursively till the highest concept or root is reached'''
if entity == 'null':
return hypernym_list
else :
query = ''' SELECT ?hypernyms WHERE {<'''+entity+'''> <http://purl.org/linguistics/gold/hypernym> ?hypernyms .}
'''
sparql.setQuery(query)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
for result in results["results"]["bindings"]:
hypernym_list.append(result['hypernyms']['value'])
if len(results["results"]["bindings"]) == 0:
return get_taxonomy(results,'null',hypernym_list)
return get_taxonomy(results,results["results"]["bindings"][0]['hypernyms']['value'],hypernym_list)
def get_taxonomy_of_resource(dbpedia_resource):
list_for_hypernyms=[]
results = {}
results["results"]={}
results["results"]["bindings"]=[1,2,3]
taxonomy_list = get_taxonomy(results,dbpedia_resource,list_for_hypernyms)
return taxonomy_list
The code works for the following input:
get_taxonomy_of_resource('http://dbpedia.org/resource/Barack_Obama')
Output:
['http://dbpedia.org/resource/Politician',
'http://dbpedia.org/resource/Person', 'http://dbpedia.org/resource/Being']
Problem :
But for following output it only gives hypernym till one level above and stops:
get_taxonomy_of_resource('http://dbpedia.org/resource/Steve_Jobs')
Output:
['http://dbpedia.org/resource/Entrepreneur']
Research:
On doing some research on their site dbpedia.org/page/<term> I realized that the reason it stopped at Entrepreneur is that when I click on this resource on their site, it takes me to resource 'Entrepreneurship' and state its hypernym as 'Process'. So now my problem has been directed to the question:
How do I know that Entrepreneur is directing to Entrepreneurship even though both are valid DBpedia entities? My recursive function fails due to this as in next iteration it attempts to find hypernym for Entrepreneur rather than Entrepreneurship.
Any help is duly appreciated
I have faced this same problem before while writing a program to generate taxonomies and my solution was to use in addition wiktionary when my main ressource failed to provide a hypernym.
The wiktionary dump can be downloaded and parsed into a python dictionary.
For example, the wiktionary entry for 'entrepreneur' contains the following:
Noun
entrepreneur (plural entrepreneurs)
A person who organizes and operates a business venture and assumes much of the associated risk.
From this definition, the hypernym ('person') can be extracted.
Naturally, this approach entails writing code to extract the hypernym from a definition (a task which is at times easy and at times hard depending on the wording of the definition).
This approach provides a fallback routine in cases when the main ressource (DBpedia in your case) fails to provide a hypernym.
Finally, as stated by AKSW, it is good to have a method to capture incorrect hypernym relations (e.g. Entrepreneur - Process). There is the area of textual entailment in natural language processing, which studies methods for determining if a statement contradicts (or implies or .. ) another statement.

Create list of words with similar spelling

Given a word (English or non-English), how can I construct a list of words (English or non-English) with similar spelling?
For example, given the word 'sira', some similar words are:
sirra
seira
siara
saira
shira
I'd prefer this to be on the verbose side, meaning it should generate as many words as possible.
Preferably in Python, but code in any language is helpful.
The Australian Business Register ABN lookup tool (a tool that finds business registration numbers based on search keywords) does a good job of this.
Thanks
The thing you are looking for is provided by ispell (and family) of dictionaries. There is a relatively easy interface via hunspell library.
The actual data (dictionaries) you can download from here (among other places, like OpenOffice plugin pages).
There is an interface to get a number of similar words based on the edit distance suggested in the comment. Going with the example from GitHub:
>>> import hunspell
>>> hobj = hunspell.HunSpell('/usr/share/hunspell/en_US.dic', '/usr/share/hunspell/en_US.aff')
>>> hobj.spell('spookie')
False
>>> hobj.suggest('spookie')
['spookier', 'spookiness', 'spook', 'cookie', 'bookie', 'Spokane', 'spoken']
For searching in databases use "LIKE"
The query you'd want is
SELECT * FROM `testTable` WHERE name LIKE "%s%i%r%a%

Compare two files and make a list

I have two files that I want to compare with each other and form a list. Each file have their own class. Book and Person. In these, I have different attributes. The ones I want to compare are: person.personalcode == book.borrowed. From this I want a list of all the borrowed books. I have started like this:
for person in person_list:
for book in booklibrary_list:
if person.personalcode == book.borrowed:
person.books.append(book, person)
for person in person_list:
if len(person.books) > 0:
print(person.personalcode + "," + person.firstname + person.lastname + "have borrowed the following books: ")
for book in person.books:
print(book)
for person in person_list:
person.books = []
But it does not work, what have I missed or done wrong?
Posting as an answer as this is too long for a comment.
First: improve your question. Show how you construct the Person and the Book class, and how you populate them. Describe what the personalcode is and how come personalcode would be the same as a book code. Some sample data and a bit more code would make this easier to answer.
Second: reading your other question, you seem to be storing your data in a text file, loading and querying, modifying and saving the data directly. This will lead you to problems and instead you should consider going down one of two lines:
Use an SQL database, possibly the easiest to start with is SQLite as it does not need a server to be set up and there is a module in the standard library that is very easy to use. Store your data there and you will find it easier in the long run.
Use Python objects (e.g. three classes: Person, Book, and BorrowedBook), manage lists of them within the program, and use shelve from the standard library to store and retrieve these lists of objects between queries.
The use of shelve would be easier if you have not used SQL before, and I hope you will forgive the pun when I say that it might be very appropriate for a book-related application!

how to implement python spell checker using google's "did you mean?"

I'm looking for a way to make a function in python where you pass in a string and it returns whether it's spelled correctly. I don't want to check against a dictionary. Instead, I want it to check Google's spelling suggestions. That way, celebrity names and other various proper nouns will count as being spelled correctly.
Here's where I'm at so far. It works most of the time, but it messes up with some celebrity names. For example, things like "cee lo green" or "posner" get marked as incorrect.
import httplib
import xml.dom.minidom
data = """
<spellrequest textalreadyclipped="0" ignoredups="0" ignoredigits="1" ignoreallcaps="1">
<text> %s </text>
</spellrequest>
"""
def spellCheck(word_to_spell):
con = httplib.HTTPSConnection("www.google.com")
con.request("POST", "/tbproxy/spell?lang=en", data % word_to_spell)
response = con.getresponse()
dom = xml.dom.minidom.parseString(response.read())
dom_data = dom.getElementsByTagName('spellresult')[0]
if dom_data.childNodes:
for child_node in dom_data.childNodes:
result = child_node.firstChild.data.split()
for word in result:
if word_to_spell.upper() == word.upper():
return True;
return False;
else:
return True;
Peter Norvig tells you how implement spell checker in Python.
Rather than sticking to Mr. Google, try out other big fellows.
If you really want to stick with search engines which count page requests, Yahoo and Bing are providing some excellent features. Yahoo is directly providing spell checking services using YQL tables (Free: 5000 request/day and non-commercial).
You have good number of Python API's which are capable to do a lot similar magic including on nouns that you mentioned (sometimes may turn around - after all its somewhere based upon probability)
So, in the second case, you got a good list (totally free)
GNU - Aspell (Even got python bindings)
PyEnchant
Whoosh (It does a lot more than spell checking but I think it has some edge on it.)
I hope they should give you a clear idea of how things work.
Actually spell checking involves very complex mechanisms in the areas of Machine learning, AI, NLP.. etc a lot more. So, companies like Google/ Yahoo don't really offer their API entirely free.

How to match search strings to content in python

Usually when we search, we have a list of stories, we provide a search string, and expect back a list of results where the given search strings matches the story.
What I am looking to do, is the opposite. Give a list of search strings, and one story and find out which search strings match to that story.
Now this could be done with re but the case here is i wanna use complex search queries as supported by solr. Full details of the query syntax here. Note: i wont use boost.
Basically i want to get some pointers for the doesitmatch function in the sample code below.
def doesitmatch(contents, searchstring):
"""
returns result of searching contents for searchstring (True or False)
"""
???????
???????
story = "big chunk of story 200 to 1000 words long"
searchstrings = ['sajal' , 'sajal AND "is a jerk"' , 'sajal kayan' , 'sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python))' , 'bangkok']
matches = [[searchstr] for searchstr in searchstrings if doesitmatch(story, searchstr) ]
Edit: Additionally would also be interested to know if any module exists to convert lucene query like below into regex:
sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python) OR "is a jerk")
After extensive googling, i realized what i am looking to do is a Boolean search.
Found the code that makes regex boolean aware : http://code.activestate.com/recipes/252526/
Issue looks solved for now.
Probably slow, but easy solution:
Make a query on the story plus each string to the search engine. If it returns anything, then it matches.
Otherwise you need to implement the search syntax yourself. If that includes things like "title:" and stuff this can be rather complex. If it's only the AND and OR from your example, then it's a recursive function that isn't too hairy.
Some time ago I looked for a python implementaion of lucene and I came accross of Woosh which is a pure python text-based research engine. Maybe it will statisfy your needs.
You can also try pyLucene, but i did'nt investigate this one.
Here's a suggestion in pseudocode. I'm assuming you store a story identifier with the search terms in the index, so that you can retrieve it with the search results.
def search_strings_matching(story_id_to_match, search_strings):
result = set()
for s in search_strings:
result_story_ids = query_index(s) # query_index returns an id iterable
if story_id_to_match in result_story_ids:
result.add(s)
return result
This is probably less interesting to you now, since you've already solved your problem, but what you're describing sounds like Prospective Search, which is what you call it when you have the query first and you want to match it against documents as they come along.
Lucene's MemoryIndex is a class that was designed specifically for something like this, and in your case it might be efficient enough to run many queries against a single document.
This has nothing to do with Python, though. You'd probably be better off writing something like this in java.
If you are writing Python on AppEngine, you can use the AppEngine Prospective Search Service to achieve exactly what you are trying to do here. See: http://code.google.com/appengine/docs/python/prospectivesearch/overview.html

Categories

Resources