TextBlob Naive Bayes. Choosing highest likelihood

TextBlob Naive Bayes. Choosing highest likelihood - python

As training data, have reviews of restaurants in XML, with associated target expression a sentiment is being expressed toward, a category which is a discrete label this belongs to and the polarity expressed toward this:
<text>With the great variety on the menu , I eat here often and never get bored .</text>
<Opinions>
<Opinion target="menu" category="FOOD#STYLE_OPTIONS" polarity="positive" from="30" to="34"/>
</Opinions>
I have used the TextBlob NB classifier to train targets terms to associated categories.
For test data, my aim is to predict the target expression, given a sentence and the category. I have first extracted nouns and noun phrases from the sentence, assuming the expression will be a subset of these. For the sentence:
"what may be interesting to most is the worst sevice attitude come from the owner of this establishment", these are ['sevice attitude', 'owner', 'establishment'].
I would like to know which of these is most likely given the category, which in this case is SERVICE#GENERAL. How could I go about this?

TextBlob's NB classifier by default extracts the text features as a bag of words. So you can simply concatenate the words in the list of extracted nouns and then concatenate it with the category to use the result as the training text. And use the target as the training label.
Considering the bag of words treat words independently, you should tranform these noun phrases in just one word. You can put a '-' instead of space, for example ('sevice attitude' would be 'sevice-attitude').
Example:
from textblob.classifiers import NaiveBayesClassifier
train = [('sevice-attitude owner establishment SERVICE#GENERAL', 'owner'),
('menu variety FOOD#STYLE_OPTIONS', 'menu')]
cl = NaiveBayesClassifier(train)
If you want you can customize the feature extraction: https://textblob.readthedocs.io/en/dev/classifiers.html#feature-extractors

Related

Setting n-grams for sentiment analysis with Python and TextBlob

I want to do sentiment analysis of some sentences with Python and TextBlob lib.
I know how to use that, but Is there any way to set n-grams to that?
Basically, I do not want to analyze word by word, but I want to analyze 2 words, 3 words, because phrases can carry much more meaning and sentiment.
For example, this is what I have done (it works):
from textblob import TextBlob
my_string = "This product is very good, you should try it"
my_string = TextBlob(my_string)
sentiment = my_string.sentiment.polarity
subjectivity = my_string.sentiment.subjectivity
print(sentiment)
print(subjectivity)
But how can I apply, for example n-grams = 2, n-grams = 3 etc?
Is it possible to do that with TextBlob, or VaderSentiment lib?

Here is a solution that finds n-grams without using any libraries.
from textblob import TextBlob
def find_ngrams(n, input_sequence):
# Split sentence into tokens.
tokens = input_sequence.split()
ngrams = []
for i in range(len(tokens) - n + 1):
# Take n consecutive tokens in array.
ngram = tokens[i:i+n]
# Concatenate array items into string.
ngram = ' '.join(ngram)
ngrams.append(ngram)
return ngrams
if __name__ == '__main__':
my_string = "This product is very good, you should try it"
ngrams = find_ngrams(3, my_string)
analysis = {}
for ngram in ngrams:
blob = TextBlob(ngram)
print('Ngram: {}'.format(ngram))
print('Polarity: {}'.format(blob.sentiment.polarity))
print('Subjectivity: {}'.format(blob.sentiment.subjectivity))
To change the ngram lengths, change the n value in the function find_ngrams().

There is no parameter within textblob to define n-grams as opposed to words/unigrams to be used as features for sentiment analysis.
Textblob uses a polarity lexicon to calculate the overall sentiment of a text. This lexicon contains unigrams, which means it can only give you the sentiment of a word but not a n-gram with n>1.
I guess you could work around that by feeding bi- or tri-grams into the sentiment classifier, just like you would feed in a sentence and then create a dictionary of your n-grams with their accumulated sentiment value.
But I'm not sure that this is a good idea. I'm assuming you are looking for bigrams to address problems like negation ("not bad") and the lexicon approach won't be able to use not for flipping the sentiment value for bad.
Textblob also contains an option to use a naiveBayes classifier instead of the lexicon approach. This is trained on a movie review corpus provided by nltk but the default features for training are words/unigrams as far as I can make out from peeking at the source code.
You might be able to implement your own feature extractor within there to extract n-grams instead of words and then re-train it accordingly and use for your data.
Regardless of all that, I would suggest that you use a combination of unigrams and n>1-grams as features, because dropping unigrams entirely is likely to affect your performance negatively. Bigrams are much more sparsely distributed, so you'll struggle with data sparsity problems when training.

Why can't spacy differentiate between two homograph tokens in the following code?

A homograph is a word that has the same spelling as another word but has a different sound and a different meaning, for example,lead (to go in front of) / lead (a metal) .
I was trying to use spacy word vectors to compare documents with each other by summing each word vector for each document and then finally finding cosine similarity. If for example spacy vectors have the same vector for the two 'lead' listed above , the results will be probably bad.
In the code below , why does the similarity between the two 'bank'
tokens come out as 1.00 ?
import spacy
nlp = spacy.load('en')
str1 = 'The guy went inside the bank to take out some money'
str2 = 'The house by the river bank.'
str1_tokenized = nlp(str1.decode('utf8'))
str2_tokenized = nlp(str2.decode('utf8'))
token1 = str1_tokenized[-6]
token2 = str2_tokenized[-2]
print 'token1 = {} token2 = {}'.format(token1,token2)
print token1.similarity(token2)
The output for given program is
token1 = bank token2 = bank
1.0

As kntgu already pointed out, spaCy distinguishes tokens by their characters, not by their semantic meaning. The sense2vec approach by the developers of spaCy concatenates tokens with their POS-tag and can help in the case of 'lead_VERB' vs. 'lead_NOUN'. However, it will not help in your example of 'bank (river bank)' vs. 'bank (financial institute)', as both are nouns.
SpaCy does not support any solution to this out of the box, but you can have a look at contextualized word representations like ELMo or BERT. Both generate word vectors for a given sentence, taking the context into account. Therefore, I assume the vectors for both 'bank' tokens will be substantially different.
Both are relatively recent approaches and are not as comfortable to use, but might help in your use case. For ELMo, there is a command line tool which lets you generate word embeddings for a set of sentences without having to write any code: https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md#writing-contextual-representations-to-disk

Using gensim's Phraser with pre-trained vectors

I am trying to use pre-trained word embeddings taking into account phrases. Popular pre-trained embeddings like GoogleNews-vectors-negative300.bin.gz have separate embeddings for phrases as well as unigrams e.g., embeddings for New_York and the two unigrams New and York. Naive word tokenization and dictionary look-up ignore the bigram embedding.
Gensim provides a nice Phrase model, where given a text sequence it can learn compact phrases e.g., New_York instead of two unigrams New and York. This is done by aggregating and comparing count statistics between the unigrams and the bigram. 1. Is it possible to use Phrase with pre-trained embeddings without estimating the count statistics elsewhere?
Is it possible to use Phrase with pre-trained embeddings without estimating the count statistics elsewhere?
If not, is there an efficient way to use these bigrams? I can imagine a way using a loop, but I believe it is ugly (Below).
Here is the ugly code.
from ntlk import word_tokenize
last_added = False
sentence = 'I love New York.'
tokens = ["<s>"]+ word_tokenize(sentence) +"<\s>"]
vectors = []
for index, token in enumerate(tokens):
if last_added:
last_added=False
continue
if "%s_%s"%(tokens[index-1], token) in model:
vectors.append("%s_%s"%(tokens[index-1], token))
last_added = True
else:
vectors.append(tokens[index-1])
lase_added = False

Text similarity with gensim and cosine similarity

from gensim import corpora, models, similarities
documents = ["This is a book about cars, dinosaurs, and fences"]
# remove common words and tokenize
stoplist = set('for a of the and to in - , is'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# Remove commas
texts[0] = [text.replace(',','') for text in texts[0]]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "I like cars and birds"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])
sims = index[vec_lsi] # perform a similarity query against the corpus
print(sims)
In the above code I am comparing how much "This is a book about cars, dinosaurs, and fences" is similar to "I like cars and birds" using the cosine similarity technique.
The two sentences have effectively 1 words in common, which is "cars", however when I run the code I get that they are 100% similar. This does not make sense to me.
Can someone suggest how to improve my code so that I get a reasonable number?

These topic-modelling techniques need varied, realistic data to achieve sensible results. Toy-sized examples of just one or a few text examples don't work well – and even if they do, it's often just good luck or contrived suitability.
In particular:
a model with only one example can't sensibly create multiple topics, as there's no contrast-between-documents to model
a model presented with words it hasn't seen before ignores those words, so your test doc appears to it the same as the single word 'cars' – the only word it's seen before
In this case, both your single training document, and the test document, get modeled by LSI as having 0 contribution from the 0th topic, and positive contribution (of different magnitudes) from the 1st topic. Since cosine-similarity merely compares angle, and not magnitude, both docs are along-the-same-line-from-the-origin, and so have no angle-of-difference, and thus similarity 1.0.
But if you had better training data, and more than a single-known-word test doc, you might start to get more sensible results. Even a few dozen training docs, and a test doc with several known words, might help... but hundreds or thousands or tens-of-thousands training-docs would be even better.

How to extract rows with only meaningful text in a column

I have a large excel file like the following:
Timestamp Text Work Id
5/4/16 17:52 rain a lot the packs maybe damage. Delivery XYZ
5/4/16 18:29 wh. screen Other ABC
5/4/16 14:54 15107 Lane Pflugerville,
TX customer called me and his phone
number and my phone numbers were not
masked. thank you customer has had a
stroke and items were missing from his
delivery the cleaning supplies for his
wet vacuum steam cleaner. he needs a
call back from customer support Delivery YYY
5/6/16 13:05 How will I know if I Signing up ASX
5/4/16 23:07 an quality Delivery DFC
I want to work only on the "Text" column and then eliminate those row that have basically just have gibberish in the "Text" column (rows 2,4,5 from the above example).
I'm reading only the 2nd column as follow:
import xlrd
book = xlrd.open_workbook("excel.xlsx")
sheet = book.sheet_by_index(0)
for row_index in xrange(1, sheet.nrows): # skip heading row
timestamp, text = sheet.row_values(row_index, end_colx=2)
text)
print (text)
How do I remove the gibberish rows? I have an idea that I need to work with nltk and have a positive corpus (one that does not have any gibberish), one negative corpus (only having gibberish text), and train my model with it. But how do I go about implementing it? Please help!!

You can use nltk to do the following.
import nltk
english_words = set(w.lower() for w in nltk.corpus.words.words())
'a' in english_words
True
'dog' in english_words
True
'asdasdase' in english_words
False
How to get individual words in nltk from string:
individual_words_front_string = nltk.word_tokenize('This is my text from text column')
individual_words_front_string
['This', 'is,' 'my', 'text', 'from', 'text', 'column']
For each rows text column, test the individual words to see if they are in the english dictionary. If they all are, you know that rows text column us not gibberish.
If your definition of gibberish vs non-gibberish is different than english words found in nltk, you can use the same process above, just with a different list of acceptable words.
How to accept numbers and street addresses?
Simple way to determine if something is a number.
word = '32423432'
word.isdigit()
True
word = '32423432ds'
word.isdigit()
False
Addresses are more difficult. You can find info on that here:Parsing Addresses, and probably many other places. Of course you can always use the above logic if you have access to a list of cities, states, roads...etc.
Will it fail if any one word is False?
It's your code you decide. Perhaps you can mark something as gibberish if x% of words in the text are false?
How to determine if grammar is correct?
This is a bigger topic, and a more in-depth explanation can be found at the following link:
Checking Grammar. But the above answer will just check if words are in the nltk corpus, not whether or not the sentence is grammatically correct.

Separating good text from 'gibber' is not a trivial task, especially if you are dealing with text messages / chats (that's what it looks like to me).
A misspelled word does not make a sample unusable and even a syntactically wrong sentence should not disqualify the whole text. That's a standard you could use for newspaper texts, but not for raw, user generated content.
I would annotate a corpus in which you separate the good samples from the bad ones and train a simple classifier on in. Annotation does not have to be a big effort, since these gibberish texts are shorter than the good ones and should be easy to recognise (at least some). Also, you could try to start with a corpus size of ~100 datapoints (50 good / 50 bad) and expand it when the first model is more or less working.
This is a sample code that I always use for text classification. You need to install scikit-learn and numpy though:
import re
import random
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
# Prepare data
def prepare_data(data):
"""
data is expected to be a list of tuples of category and texts.
Returns a tuple of a list of lables and a list of texts
"""
random.shuffle(data)
return zip(*data)
# Format training data
training_data = [
("good", "rain a lot the packs maybe damage."),
("good", "15107 Lane Pflugerville, TX customer called me and his phone number and my phone numbers were not masked. thank you customer has had a stroke and items were missing from his delivery the cleaning supplies for his wet vacuum steam cleaner. he needs a call back from customer support "),
("gibber", "wh. screen"),
("gibber", "How will I know if I")
]
training_labels, training_texts = prepare_data(training_data)
# Format test set
test_data = [
("gibber", "an quality"),
("good", "<datapoint with valid text>",
# ...
]
test_labels, test_texts = prepare_data(test_data)
# Create feature vectors
"""
Convert a collection of text documents to a matrix of token counts.
See: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
"""
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(training_texts)
y = training_labels
# Train the classifier
clf = LogisticRegression()
clf.fit(X, y)
# Test performance
X_test = vectorizer.transform(test_texts)
y_test = test_labels
# Generates a list of labels corresponding to the samples
test_predictions = clf.predict(X_test)
# Convert back to the usual format
annotated_test_data = list(zip(test_predictions, test_texts))
# evaluate predictions
y_test = np.array(test_labels)
print(metrics.classification_report(y_test, test_predictions))
print("Accuracy: %0.4f" % metrics.accuracy_score(y_test, test_predictions))
# predict labels for unknown texts
data = ["text1", "text2",]
# Important: use the same vectorizer you used for the training.
# When saving the model (e.g. via pickle) always serialize
# classifier & vectorizer
X = vectorizer.transform(data)
# Now predict the labels for the texts in 'data'
labels = clf.predict(X)
# And put them back together
result = list(zip(labels, data))
# result = [("good", "text1"), ("gibber", "text2")]
A few words about how it works: The count vectorizer tokenizes the text and creates vectors containing the counts for all words in the corpus. Based upon these vectors, the classifier tries to recognise patters to distinguish between both categories. A text with only a few and uncommon (b/c misspelled) words would rather be in the 'gibber' category, while a text with a lot of words that are typical for common sentences (think of all the stop words here: 'I', 'you', 'is'... ) is more prone to be a good text.
If this method works for you, you should also try other classifiers and use the first model to semi-automatically annotate a larger training corpus.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.