How to use Freebase to label a very large unlabeled NLP dataset? - python

Vocabulary that I am using:
nounphrase -- A short phrase that refers to a specific person, place, or idea. Examples of different nounphrases include "Barack Obama", "Obama", "Water Bottle", "Yellowstone National Park", "Google Chrome web browser", etc.
category -- The semantic concept defining which nounphrases belong to it and which ones do not. Examples of categories include, "Politician", "Household items", "Food", "People", "Sports teams", etc. So, we would have that "Barack Obama" belongs to "Politician" and "People" but does not belong to "Food" or "Sports teams".
I have a very lage unlabeled NLP dataset consisting of millions of nounphrases. I would like to use Freebase to label these nounphrases. I have a mapping of Freebase types to my own categories. What I need to do is download every single examples for every single Freebase type that I have.
The problem that I face is that need to figure out how to structure this type of query. At a high level, the query should ask Freebase "what are all of the examples of topic XX?" and Freebase should respond with "here's a list of all examples of topic XX." I would be very grateful if someone could give me the syntax of this query. If it can be done in Python, that would be awesome :)

The basic form of the query (for a person, for example) is
[{
"type":"/people/person",
"name":None,
"/common/topic/alias":[],
"limit":100
}]​
There's documentation available at http://wiki.freebase.com/wiki/MQL_Manual
Using freebase.mqlreaditer() from the Python library http://code.google.com/p/freebase-python/ is the easiest way to cycle through all of these. In this case, the "limit" clause determines the chunk size used for querying, but you'll get each result individually at the API level.
BTW, how do you plan to disambiguate Jack Kennedy the president, from the hurler, from the football player, from the book, etc, etc http://www.freebase.com/search?limit=30&start=0&query=jack+kennedy You may want to consider capturing additional information from Freebase (birth & death dates, book authors, other types assigned, etc) if you'll have enough context to be able to use it to disambiguate.
Past a certain point, it may be easier and/or more efficient to work from the bulk data dumps rather than the API http://wiki.freebase.com/wiki/Data_dumps
Edit - here's a working Python program which assumes you've got a list of type IDs in a file called 'types.txt':
import freebase
f = file('types.txt')
for t in f:
t=t.strip()
q = [{'type':t,
'mid':None,
'name':None,
'/common/topic/alias':[],
'limit':500,
}]
for r in freebase.mqlreaditer(q):
print '\t'.join([t,r['mid'],r['name']]+r['/common/topic/alias'])
f.close()
If you make the query much more complex, you'll probably want to lower the limit to keep from running into timeouts, but for a simple query like this, boosting the limit above the default of 100 will make it more efficient by querying in bigger chunks.

The general problem described here is called Entity Linking in natural language processing.
Unabashed self plug:
See our book chapter on the topic for an introduction and an approach to perform large scale entity linking.
http://cs.jhu.edu/~delip/entity_linking.pdf
#deliprao

Related

"Get" document from cosmosdb by id (not knowing the _rid)

As MS Support recently told me that using a "GET" is much more efficient in RUs usage than a sql query. I'm wondering if I can (within the azure.cosmos python package or a custom HTTP request to the REST API) get a document by its unique 'id' field (for which I generated a GUIDs) without an SQL Query.
Every example shown are using the link/path of the doc which is built with the '_rid' metadata of the document and not the 'id' field set when creating the doc.
I use a bulk upsert stored procedure I wrote to create my new documents and never retrieve the metadata for each one of them (I have ~ 100 millions docs) so retrieving the _rid would be equivalent to retrieving the doc itself.
The reason that the ReadDocument method is so much more efficient than a SQL query is because it uses _rid instead of a user generated field, even the required id field. This is because the _rid isn't just a unique value, it also encodes information about where that document is physically stored.
To give an example of how this works, let's say you are explaining to someone where a party is this weekend. You could use the name that you use for the house "my friend Ryan's house" or you could use the address "123 ThatOne Street Somewhere, WA 11111". They both are unique identifiers, but for someone trying to get there one is way more efficient than the other.
Telling someone to go to your friend's house is like using your own id. It does map to a specific house, but the person will still need to find out where that physically is to get there. Using the address is like working with the _rid field. Based on that information alone they can get to the party location. Of course, in the real world the person would probably need directions, but the data storage in a database is a lot more organized than most city streets so an address is sufficient to go retrieve the document.
If you want to take advantage of this method you will need to find a way to work with the _rid field.

NLP general English to action

I am working on automating task flow of application using text based Natural Language Processing.
It is something like chatting application where the user can type in the text area. At same time python code interprets what user wants and it performs the corresponding action.
Application has commands/actions like:
Create Task
Give Name to as t1
Add time to task
Connect t1 to t2
The users can type in chat (natural language). It will be like a general English conversation, for example:
Can you create a task with name t1 and assign time to it. Also, connect t1 to t2
I could write a rule drive parser, but it would be limited to few rules only.
Which approach or algorithm can I use to solve this task?
How can I map general English to command or action?
I think the best solution would be to use an external service like API.ai or wit.ai. You can create a free account and then you can map certain texts to so-called 'intents'.
These intents define the main actions of your system. You can also define 'entities' that would capture, for instance, the name of the task. Please have a look at these tools. I'm sure they can handle your use case.
I think your issue is related to Rule-based system (Wiki).
You need to two basic components in core of project like this:
1- Role base:
list of your roles.
2- Inference engine:
infers information or takes action based on the interaction of input and the rule base.
spacy is python approach that I think it will help you. (More information).
You may want to try nltk. This is an excellent library for NLP and comes with a handy book to get you started. I think you may find chapter 8 helpful for finding sentence structure, and chapter 7 useful for figuring out what your user is requesting the bot to do. I would recommend you read the entire thing if you have more than a passing interest in NLP, as most of it is quite general and can be applied outside of NLTK.
What you are describing is a general problem with quite a few possible solutions. Your business requirements, which we do not know, are going to heavily influence the correct approach.
For example, you will need to tokenize the natural language input. Should you use a rules-based approach, or a machine learning one? Maybe both? Let's consider your input string:
Can you create a task with name t1 and assign time to it. Also, connect t1 to t2
Our system might tokenize this input in the following manner:
Can you [create a task] with [name] [t1] and [assign] [time] to it. Also, [connect] [t1] to [t2]
The brackets indicate semantic information, entirely without structure. Does the structure matter? Do you need to know that connect t1 is related to t2 in the text itself, or can we assume that it is because all inputs are going to follow this structure?
If the input will always follow this structure, and will always contain these kinds of semantics, you might be able to get away with parsing this using regular expressions and feeding prebuilt methods.
If the input is instead going to be true natural language (ie, you are building a siri or alexa competitor) then this is going to be wildly more complex, and you aren't going to get a useful answer in a SO post like this. You would instead have a few thousand SO posts ahead of you, assuming you have sufficient familiarity with both linguistics and computer science to allow you to approach the problem systematically.
Lets say text is "Please order a pizza for me" or "May I have a cab booking from uber"
Use a good library like nltk and parse these sentences. As social English is generally grammatically incorrect, you might have to train your parser with your custom broken English corpora. Next, These are the steps you have to follow to get an idea about what a user wants.
Find out the full stop's in a paragraph, keeping in mind the abbreviations, lingos like ...., ??? etc.
Next find all the verbs and noun phrases in individual sentences can be done through POS(part of speech tagging) by different libraries.
After that the real work starts, My approach would be to create a graph of verbs where similar verbs are close to each other and dissimilar verbs are very far off.
Lets say you have words like arrange, instruction , command, directive, dictate which are closer to order. So if your user writes any one of the above verbs in their text , your algorithm will identify that user really means to imply order. you can also use edges of that graph to specify the context in which the verb was used.
Now, you have to assign action to this verb "order" based on the noun phrase which were parsed in the original sentence.
This is just a high level explanation of this algorithm, it has many problems which needs serious considerations, some of them are listed below.
Finding similarity index between root_verb and the given verb in very short time.
New words who doesn't have an entry in the graph. A possible approach is to update your graph by searching google for this word, find a context from the pages on which it was mentioned and find an appropriate place for this new word in the graph.
Similarity indexes of misspelled words with proper verbs or nouns.
If you want to build a more sophisticated model, you can construct graph for every part of speech and can select appropriate words from each graph to form sentences in response to the queries. Above mentioned graph is meant for Verb Part of speech.
Although, #whrrgarbl is right. It seems like you do not want to train a bot.
So, then to handle language input variations(lexical, semantic..) you would need a pre-trained bot which you can customize(or may be just add rules according to your need).
The easiest business oriented solution is Amazon Lex. There is a free preview program too.
Another option would be to use Google's Parsey McParseface(a pre-trained English parser, there is support for 40 languages) and integrate it with a chat-framework. Here is a link to a python repo, where the author claims to have made the installation and training process convenient.
Lastly, this provides a comparison of various chatbot platforms.

Searching for abbreviated words in MySQL

I have a MySQL database, working with Python, with some entries for products such as Samsung Television, Samsung Galaxy Phone etc. Now, if a user searches for Samsung T.V or just T.V, is there any way to return the entry for Samsung Television?
Do full text search libraries like Solr or Haystack support such features? If not, then how do I actually proceed with this?
Thanks.
yes, Solr will surely allow you to do this and much more.You can start Here
and SolrCloud is a really good way to provide for High Availabilty to end users.
You should have a look at the SynonymFilterFactory for your analyzer. When reading the documentation you will find this section that rather sounds like the scenario you describe.
Even when you aren't worried about multi-word synonyms, idf differences still make index time synonyms a good idea. Consider the following scenario:
An index with a "text" field, which at query time uses the SynonymFilter with the synonym TV, Televesion and expand="true"
Many thousands of documents containing the term "text:TV"
A few hundred documents containing the term "text:Television"
You should keep in mind to have separate analyzers for index and query time, as described in this SO question How to make solr synonyms work.

How should I structure a GAE datastore to be able to grab professions related to a keyword?

If someone searches for "teeth doctor", I would like to return entries from a google app engine datastore for dentists. Similarly, "foot doctor" would return podiatrists, "childrens' doctor" pediatrician, etc.
How should I find related keywords, and should I store them with the doctor entries, in a separate table, or grab them on request?
I'm thinking of having one entity for the professionals - it would include their name, location, contact info, etc, but most importantly, the formal name for their profession. And, another table for a relation of words to professions. For example, "teeth" would map to dentist, but also orthodontist. Would this be the best way to go about it?
Also, is there a way to have google sort the results by multiple things? I would like to list the most relevant results, but also have priority on slightly less-related, but closer doctors. For example, if a user searches for "teeth", I would want the results to be in the order of: 1. A dentist 0.5 miles away, 2. An orthadontist 0.2 miles away, and 3. A dentist 5 miles away. What I'm currently thinking for this is keeping track of the estimated percent-likelihood that a searched keyword is meant to return a certain profession and then calculate that into the distance calculator that I would be using and sorting by.
I would probably go with having a profession kind and a professional kind. Professional entities then reference the applicable profession. Profession entities would contain your keywords. You could then use the new app engine search feature to index and search professions (Search Overview (Python)) and use the results to look up professionals. Indexing your professionals this way as well would give you some/all of the location based searching you want to implement.

Google Apps Engine Datastore Search

I was wondering if there was any way to search the datastore for a entry. I have a bunch of entries for songs(title, artist,rating) but im not sure how to really search through them for both song title and artist. We take in a search term and are looking for all entries that "match." But we are lost :( any help is much appreciated!
We are using python
edit1: current code is useless, its an exact search but might help you see the issue
query = song.gql("SELECT * FROM song WHERE title = searchTerm OR artist = searchTerm")
The song data you work with sounds as a rather static data set (primarily inserts, no or few updates). In that case there is GAE technique called Relation Index Entity (RIE) which is an efficient way to implement keyword-based search.
But some preparation work required which is briefly:
build special RIE entity where you place all searchable keywords
from each song (one-to-one relationship).
RIE stores them in StringListProperty which supports searches like this:
keywords = 'SearchTerm'
(returns True if any of the values in the list keywords matches 'SearchTerm'`)
AND condition works immediately by adding multipe filters as above
OR condition needs more work by implementing in-memory merge from AND-only queries
You can find details on solution workflow and code samples in my blog Relation Index Entities with Python for Google Datastore.
http://www.billkatz.com/2009/6/Simple-Full-Text-Search-for-App-Engine

Categories

Resources