How to calculate confidence score from dependency parse tree? - python

Is there any way to get confidence score or any score from dependency parse tree of a sentence using ntlk or something else?
Any advice and suggestions will be greatly appreciated!

It's a hard task, I am not aware of any tool doing it, but if you probably post something on the corpora mailing list, or language technology section of reddit you will get better replies. But if was a research question, I would suggest training a PCFG on a penntreebank dataset and then using it to compute the probabilities of parse trees assigned to sentences. You can grab Mark Johnson's implementation. Search for this line:
cky.tbz contains a very fast C implementation of a CKY PCFG parser,
together with programs for extracting PCFGs from treebanks, etc. This
was used in my 1999 CL article. (last updated 6th March, 2006)
CYK (viterbi) is a dynamic programming algorithm. PCFG stands for probabilistic CFG, which you typically train using penntreebank dataset. The summation over the probabilities of all possible parse trees for a sentence can be interpreted as how grammatically correct the sentence is. Sorry if this wasn't the actual answer, but this is a working answer and I can tell you more details if you decided to do it :).

Related

topic classification using k-gram index

I have a set of topics each described with a list of keywords. {Sports:['Ronaldo Messi Zidane','Football Baseball', 'Barcelona Real']...}
The task is to classify a particular document. The classification can be also multi-label. A document can belong to topic1,topic2 etc. I don't have enough data thus can't approach the problem using machine learning. Because I want to retrieve highly precise documents I approached the problem using k-gram index.
I treat a given set of topic keywords as queries and built a k-gram index around it. So I have all the keys as character bigrams and the values as terms which contain the bigram. These terms are terms present in the document that I want to classify. After traversing the postings list for every keyword of a topic I get a set of candidate terms and their corresponding jaccard similarity score.
Within a topic How do I combine jaccard score of all candidate terms ?
Within all topics how do I decide which topic this document belongs to ?
Do you think this approach can give me results with high precision ?
Thank you.
This seems like a multi-class multi-label classification problem. Since the questioner is comfortable using a detailed lexical approach. This article here will help in building a pragmatic solution.

Extracting and ranking keywords from short text

I am working on a project to extract a keyword from short texts (3-4 sentences). Using the spaCy library I extract noun phrases and NER and use them as keywords. However, I would like to sort them based on their importance wrt the original text.
I tried standard informational retrieval approaches, like tfidf, and even a couple of graph-based algorithms but having such short text the results weren't so great.
I was thinking that maybe using a NN with an attention mechanism could help me rank those keywords. Is there any way to use the pre-trained models that come with spaCy to do some kind of ranking?
How about something like maximal marginal relevance? http://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf

Python - How to intuit word from abbreviated text using NLP?

I was recently working on a data set that used abbreviations for various words. For example,
wtrbtl = water bottle
bwlingbl = bowling ball
bsktball = basketball
There did not seem to be any consistency in terms of the convention used, i.e. sometimes they used vowels sometimes not. I am trying to build a mapping object like the one above for abbreviations and their corresponding words without a complete corpus or comprehensive list of terms (i.e. abbreviations could be introduced that are not explicitly known). For simplicity sake say it is restricted to stuff you would find in a gym but it could be anything.
Basically, if you only look at the left hand side of the examples, what kind of model could do the same processing as our brain in terms of relating each abbreviation to the corresponding full text label.
My ideas have stopped at taking the first and last letter and finding those in a dictionary. Then assign a priori probabilities based on context. But since there are a large number of morphemes without a marker that indicates end of word I don't see how its possible to split them.
UPDATED:
I also had the idea to combine a couple string metric algorithms like a Match Rating Algorithm to determine a set of related terms and then calculate the Levenshtein Distance between each word in the set to the target abbreviation. However, I am still in the dark when it comes to abbreviations for words not in a master dictionary. Basically, inferring word construction - may a Naive Bayes model could help but I am concerned that any error in precision caused by using the algorithms above will invalid any model training process.
Any help is appreciated, as I am really stuck on this one.
If you cannot find an exhaustive dictionary, you could build (or download) a probabilistic language model, to generate and evaluate sentence candidates for you. It could be a character n-gram model or a neural network.
For your abbreviations, you can build a "noise model" which predicts probability of character omissions. It can learn from a corpus (you have to label it manually or half-manually) that consonants are missed less frequently than vowels.
Having a complex language model and a simple noise model, you can combine them using noisy channel approach (see e.g. the article by Jurafsky for more details), to suggest candidate sentences.
Update. I got enthusiastic about this problem and implemented this algorithm:
language model (character 5-gram trained on the Lord of the Rings text)
noise model (probability of each symbol being abbreviated)
beam search algorithm, for candidate phrase suggestion.
My solution is implemented in this Python notebook. With trained models, it has interface like noisy_channel('bsktball', language_model, error_model), which, by the way, returns
{'basket ball': 33.5, 'basket bally': 36.0}. Dictionary values are scores of the suggestions (the lower, the better).
With other examples it works worse: for 'wtrbtl' it returns
{'water but all': 23.7,
'water but ill': 24.5,
'water but lay': 24.8,
'water but let': 26.0,
'water but lie': 25.9,
'water but look': 26.6}
For 'bwlingbl' it gives
{'bwling belia': 32.3,
'bwling bell': 33.6,
'bwling below': 32.1,
'bwling belt': 32.5,
'bwling black': 31.4,
'bwling bling': 32.9,
'bwling blow': 32.7,
'bwling blue': 30.7}
However, when training on an appropriate corpus (e.g. sports magazines and blogs; maybe with oversampling of nouns), and maybe with more generous width of beam search, this model will provide more relevant suggestions.
So I've looked at a similar problem, and came across a fantastic package called PyEnchant. If you use the build in spell-checker you can get word suggestions, which would be a nice and simple solution. However it will only suggest single words (as far as I can tell), and so the situation you have:
wtrbtl = water bottle
Will not work.
Here is some code:
import enchant
wordDict = enchant.Dict("en_US")
inputWords = ['wtrbtl','bwlingbl','bsktball']
for word in inputWords:
print wordDict.suggest(word)
The output is:
['rebuttal', 'tribute']
['bowling', 'blinding', 'blinking', 'bumbling', 'alienable', 'Nibelung']
['basketball', 'fastball', 'spitball', 'softball', 'executable', 'basketry']
Perhaps if you know what sort of abbreviations there are you can separate the string into two words, e.g.
'wtrbtl' -> ['wtr', 'btl']
There's also the Natural Language Processing Kit (NLTK), which is AMAZING, and you could use this in combination with the above code by looking at how common each suggested word is, for example.
Good luck!
One option is to go back in time and compute the Soundex Algorithm equivalent.
Soundex drops all the vowels, handles common mispronunciations and crunched up spellings. The algorithm is simplistic and used to be done by hand. The downside is that has no special word stemming or stop work support.
... abbreviations for words not in a master dictionary.
So, you're looking for a NLP model that can come up with valid English words, without having seen them before?
It is probably easier to find a more exhaustive word dictionary, or perhaps to map each word in the existing dictionary to common extensions such as +"es" or word[:-1] + "ies".

How to automatically label a cluster of words using semantics?

The context is : I already have clusters of words (phrases actually) resulting from kmeans applied to internet search queries and using common urls in the results of the search engine as a distance (co-occurrence of urls rather than words if I simplify a lot).
I would like to automatically label the clusters using semantics, in other words I'd like to extract the main concept surrounding a group of phrases considered together.
For example - sorry for the subject of my example - if I have the following bunch of queries : ['my husband attacked me','he was arrested by the police','the trial is still going on','my husband can go to jail for harrassing me ?','free lawyer']
My study deals with domestic violence, but clearly this cluster is focused on the legal aspect of the problem so the label could be "legal" for example.
I am new to NPL but I have to precise that I don't want to extract words using POS tagging (or at least this is not the expected final outcome but maybe a necessary preliminary step).
I read about Wordnet for sense desambiguation and I think that might be a good track, but I don't want to calculate similarity between two queries (since the clusters are the input) nor obtain the definition of one selected word thanks to the context provided by the whole bunch of words (which word to select in this case ?). I want to use the whole bunch of words to provide a context (maybe using synsets or categorization with the xml structure of the wordnet) and then summarize the context in one or few words.
Any ideas ? I can use R or python, I read a little about nltk but I don't find a way to use it in my context.
Your best bet is probably is to label the clusters manually, especially if there are few of them. This a difficult problem even for humans to solve, because you might need a domain expert. Anyone claiming they could do that automatically and reliably (except in some very limited domains) is probably running a startup and trying to get your business.
Also, going through the clusters yourself will have benefits. 1) you may discover you had the wrong number of clusters (k parameter) or that there was too much junk in the input to begin with. 2) you will gain qualitative insight into what is being talked about and what topic there are in the data (which you probably can't know before looking at the data). Therefore, label manually if qualitative insight is what you are after. If you need quantitative result too, you could then train a classifier on the manually labelled topics to 1) predict topics for the rest of the clusters, or 2) for future use, if you repeat the clustering, get new data, ...
When we talk about semantics in this area we mean Statistical Semantics. The statistical or distributional semantics is very different from other definitions of semantics which has logic and reasoning behind it. Statistical semantics is based on Distributional Hypothesis, which considers context as meaning aspect of words and phrases. Meaning in very abstract and general sense in different litterers is called topics. There are several unsupervised methods for modelling topics, such as LDA or even word2vec, which basically provide word similarity metric or suggest a list of similar words for a document as another context. Usually when you have these unsupervised clusters, you need a domain expert to tell the meaning of each cluster.
However, for several reasons you might accept low accuracy assignment of a word as the general topic (or as in your words "global semantic") to a list of phrases. If this is the case, I would suggest to take a look at Word Sense Disambiguation tasks which look for coarse grained word senses. For WordNet, it might be called supersense tagging task.
This paper worth to take a look: More or less supervised supersense tagging of Twitter
And about your question about choosing words from current phrases, there is also an active question about "converting phrase to vectors", my answer to that question in word2vec fashion might be useful:
How can a sentence or a document be converted to a vector?
I can add more related papers later if it comes to my mind.
The paper Automatic Labelling of Topic Models explains the author's approach to this problem. To provide an overview I can tell you that they generate some label candidates using the information retrieved from Wikipedia and Google, and once they have the list of candidates in place they rank those candidates to find the best label.
I think the code is not available online, but I have not looked for it.
The package chowmein claims to do this in python using the algorithm outlined in Automatic Labeling of Multinomial Topic Models.
One possible approach, which the below papers suggest is identifying the set of keywords from the cluster, getting all the synonyms and then finding the hypernyms for each synonym.
The idea is to get a more abstract meaning for the cluster by using the hypernym.
Example: A word cluster containing words dog and wolf should not be labelled with either word but as canids. They achieve it using synonymy and hypernymy.
Cluster Labeling by Word Embeddings
and WordNet’s Hypernymy
Automated Text Clustering and Labeling using Hypernyms

question on sentiment analysis

I have a question regarding sentiment analysis that i need help with.
Right now, I have a bunch of tweets I've gathered through the twitter search api. Because I used my search terms, I know what are the subjects or entities (Person names) that I want to look at. I want to know how others feel about these people.
For starters, I downloaded a list of english words with known valence/sentiment score and calculate the sentiments (+/-) based on availability of these words in the tweet. The problem is that sentiments calculated this way - I'm actually looking more at the tone of the tweet rather than ABOUT the person.
For instance, I have this tweet:
"lol... Person A is a joke. lmao!"
The message is obviously in a positive tone, but person A should get a negative.
To improve my sentiment analysis, I can probably take into account negation and modifiers from my word list. But how exactly can I get my sentiments analysis to look at the subject of the message (and possibly sarcasm) instead?
It would be great if someone can direct me towards some resources....
While awaiting for answers from researchers in AI field I will give you some clues on what you can do quickly.
Even though this topic requires knowledge from natural language processing, machine learning and even psychology, you don't have to start from scratch unless you're desperate or have no trust in the quality of research going on in the field.
One possible approach to sentiment analysis would be to treat it as a supervised learning problem, where you have some small training corpus that includes human made annotations (later about that) and a testing corpus on which you test how well you approach/system is performing. For training you will need some classifiers, like SVM, HMM or some others, but keep it simple. I would start from binary classification: good, bad. You could do the same for a continuous spectrum of opinion ranges, from positive to negative, that is to get a ranking, like google, where the most valuable results come on top.
For a start check libsvm classifier, it is capable of doing both classification {good, bad} and regression (ranking).
The quality of annotations will have a massive influence on the results you get, but where to get it from?
I found one project about sentiment analysis that deals with restaurants. There is both data and code, so you can see how they extracted features from natural language and which features scored high in the classification or regression.
The corpus consists of opinions of customers about restaurants they recently visited and gave some feedback about the food, service or atmosphere.
The connection about their opinions and numerical world is expressed in terms of numbers of stars they gave to the restaurant. You have natural language on one site and restaurant's rate on another.
Looking at this example you can devise your own approach for the problem stated.
Take a look at nltk as well. With nltk you can do part of speech tagging and with some luck get names as well. Having done that you can add a feature to your classifier that will assign a score to a name if within n words (skip n-gram) there are words expressing opinions (look at the restaurant corpus) or use weights you already have, but it's best to rely on a classfier to learn weights, that's his job.
In the current state of technology this is impossible.
English (and any other language) is VERY complicated and cannot be "parsed" yet by programs. Why? Because EVERYTHING has to be special-cased. Saying that someone is a joke is a special-case of a joke, which is another exception in your program. Etcetera, etc, etc.
A good example (posted by ScienceFriction somewhere here on SO):
Similarly, the sentiment word "unpredictable" could be positive in the context of a thriller but negative when describing the breaks system of the Toyota.
If you are willing to spend +/-40 years of your life on this subject, go ahead, it will be much appreciated :)
I don't entirely agree with what nightcracker said. I agree that it is a hard problem, but we are making a good progress towards the solution.
For example, 'part-of-speech' might help you to figure out subject, verb and object in the sentence. And 'n-grams' might help you in the Toyota vs. thriller example to figure out the context. Look at TagHelperTools. It is built on top of weka and provides part-of-speech and n-grams tagging.
Still, it is difficult to get the results that OP wants, but it won't take 40 years.

Categories

Resources