Naive Bayesian Classification (using NLTK) - python

I'm trying to use Python's NLTK to do some answer type classification. Essentially train it on a bunch of questions then give some unseen questions.
The issue I'm having is that is pretty much just classifies any question as whichever is the most common answer type. So if there's 200 questions marked as 'people' and 150 marked as 'place' then EVERY test question is marked as having the answer type 'people'.
I know that balanced data is better, but it seems like a very tight restriction (as well as not feasible, given the amount of test data I'm using). The training data I'm using is this set of 5500 questions here:
http://cogcomp.cs.illinois.edu/Data/QA/QC/train_5500.label
And this is my Python code:
import nltk
train = []
with open('data.txt') as f:
content = f.readlines()
for c in content:
parts = c.split(' ', 1)
train.append((dict(q=parts[1].rstrip()), parts[0]))
test = [
(dict(q='When was the congress of Vienna?')),
(dict(q='What is the capital of Australia?')),
(dict(q='Why doesn\'t this classifier work?'))
]
classifier = nltk.classify.NaiveBayesClassifier.train(train)
print classifier.classify_many(test)
It assigns all 3 of the test questions the 'HUM:ind' class, which is the most common question in the training set. If I reduce the number of these HUM:ind questions, it then just starts saying it's the next most popular. It only takes a couple of questions of discrepancy before that answer type overpowers all the others.
Am I missing something? Am I not using the algorithm right? Is there some parameter I should change given the format of my training data? My example is pretty similar to a couple of examples I've seen online. Any help appreciated

You always get the most frequent category back because you are not giving your classifier any useful features to work with: If you have to guess with no evidence at all, the most common class is the right answer.
The classifier can only reason about feature names and feature values it has seen before. (New data consists of known features in combinations that it has not seen before.) But your code only defines one "feature", q, and the value in each case is the entire text of the question. So all test questions are unknown (and therefore indistinguishable) feature values. You can't get something for nothing.
Learn how to train a classifier, (and how classification works while you're at it), and the problem will go away.

Related

Predicting score from text data and non-text data

This is for an assignment. I have ~100 publications on a dummy website. Each publication has been given an artificial score indicating how successful it is. I need to predict what affects that value.
So far, I've scraped the information that might affect the score and saved them into individual lists. The relevant part of my loop looks like:
for publicationurl:
Predictor 1 = 3.025
Predictor 2 = Journal A
Predictor 3 = 0
Response Variable = 42.5
Title = Sentence
Abstract = Paragraph
I can resolve most of that by putting predictors 1-3 and the response into a dataframe then doing regression. The bit that is tripping me up is the Title and Abstract text. I can strip them of punctuation and remove stopwords, but after that I'm not sure how to actually analyse them alongside the other predictors. I was looking into doing some kind of text-similarity comparison between high-high and high-low scoring pairs and basing whether the title and abstract affect the score based off of that, but am hoping there is a much neater method that allows me to actually put that text into a predictive model too.
I currently have 5 predictors besides the text and there are ~40,000 words in total across all titles and abstracts if any of that affects what kind of method works best. Ideally I'd like to end up being able to put everything into a single predictive model, but any method that can lead me to a working solution is good.
This would be an ideal situation for using Multinomial Naive Bayes. This is a relatively simple yet quite powerful method to classify the class of a text. If this is an introductory exercise I'm 99% sure that you're prof is expecting something with NB to solve the given problem.
I would recommend a library like sklearn which should make the task almost trivial. If you're interested in the intuition behind NB, this YouTube video should serve as a good introduction.
Start off by going over some examples/blog posts, google should provide you with countless examples. Then modify the code to fit your use case.
You could either group the articles in two classes e.g. score <=5 = bad, score > 5 = good. A next step would be to predict more than two classes like explained here.

What should be used between Doc2Vec and Word2Vec when analyzing product reviews?

I collected some product reviews of a website from different users, and I'm trying to find similarities between products through the use of the embeddings of the words used by the users.
I grouped each review per product, such that I can have different reviews succeeding one after the other in my dataframe (i.e: different authors for one product). Furthermore, I also already tokenized the reviews (and all other pre-processing methods). Below is a mock-up dataframe of what I'm having (the list of tokens per product is actually very high, as well as the number of products):
Product
reviews_tokenized
XGame3000
absolutely amazing simulator feel inaccessible ...
Poliamo
production value effect tend cover rather ...
Artemis
absolutely fantastic possibly good oil ...
Ratoiin
ability simulate emergency operator town ...
However, I'm not sure of what would be the most efficient between doc2Vec and Word2Vec. I would initially go for Doc2Vec, since it has the ability to find similarities by taking into account the paragraph/sentence, and find the topic of it (which I'd like to have, since I'm trying to cluster products by topics), but I'm a bit worry about the fact that the reviews are from different authors, and thus might bias the embeddings? Note that I'm quite new to NLP and embeddings, so some notions may escape me. Below is my code for Doc2Vec, which giving me a quite good silhouette score (~0.7).
product_doc = [TaggedDocument(doc.split(' '), [i]) for i, doc in enumerate(df.tokens)]
model3 = Doc2Vec(min_count=1, seed = SEED, ns_exponent = 0.5)
model3.build_vocab(product_doc)
model3.train(product_doc, total_examples=model3.corpus_count, epochs=model3.epochs)
product2vec = [model3.infer_vector((df['tokens'][i].split(' '))) for i in range(0,len(df['tokens']))]
dtv = np.array(product2vec)
What do you think would be the most efficient method to tackle this? If something is not clear enough, or else, please tell me.
Thank you for your help.
EDIT: Below is the clusters I'm obtaining:
There's no way to tell which particular mix of methods will work best for a specific dataset and particular end-goal: you really have to try them against each other, in your own reusable pipeline for scoring them against your desired results.
It looks like you've already stripped the documents down to keywords rather than original natural text, which could hurt with these algorithms - you may want to try it both ways.
Depending on the size & format of your texts, you may also want to look at doing "Word Mover's Distance" (WMD) comparisons between sentences (or other small logical chunks of your data). Some work has demo'd interesting results in finding "similar concerns" (even with different wording) in the review domain, eg: https://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/
Note, though, WMD gets quite costly to calculate in bulk with larger texts.

Decide rather a text is about "Topic A" or not - NLP with Python

I'm trying to write a python program that will decide if a given post is about the topic of volunteering. My data-sets are small (only the posts, which are examined 1 by 1) so approaches like LDA do not yield results.
My end goal is a simple True/False, a post is about the topic or not.
I'm trying this approach:
Using Google's word2vec model, I'm creating a "cluster" of words that are similar to the word: "volunteer".
CLUSTER = [x[0] for x in MODEL.most_similar_cosmul("volunteer", topn=120)]
Getting the posts and translating them to English, using Google translate.
Cleaning the translated posts using NLTK (removing stopwords, punctuation, and lemmatize the post)
Making a BOW out of the translated, clean post.
This stage is difficult for me. I want to calculate a "distance" / "similarity" / something that will help me get the True/False answer that I'm looking for, but I can't think of a good way to do that.
Thank you for your suggestions and help in advance.
You are attempting to intuitively improvise a set of steps that, in the end, will classify these posts into the two categories, "volunteering" and "not-volunteering".
You should looks for online examples that do "text classification" that are similar to your task, work through them (with their original demo data) for understanding, then adapt them incrementally to work with your data instead.
At some point, word2vec might be a helpful contributor to your task - but I wouldn't start with it. Similarly, eliminating stop-words, performing lemmatization, etc might eventually be helpful, but need not be important up front.
You'll typically want to start by acquiring (by hand-labeling if necessary) a training set of text for which you know the "volunteering" or "not-volunteering" value (known labels).
Then, create some feature-vectors for the texts – A simple starting approach that offers a quick baseline for later improvements is a "bag of words" representation.
Then, feed those representations, with the known-labels, to some existing classification algorithm. The popular scikit-learn package in Python offers many. That is: you don't yet need to be worrying about choosing ways to calculate a "distance" / "similarity" / something that will guide your own ad hoc classifier. Just feed the labeled data into one (or many) existing classifiers, and check how well they're doing. Many will be using various kinds of similarity/distance calculations internally - but that's automatic and explicit from choosing & configuring the algorithm.
Finally, when you have something working start-to-finish, no matter how modest in results, then try alternate ways of preprocessing text (stop-word-removal, lemmatization, etc), featurizing text, and alternate classifiers/algorithm paramterizations - to compare results, and thus discover what works well given your specific data, goals, and practical constraints.
The scikit-learn "Working With Text Data" guide is worth reviewing & working-through, and their "Choosing the right estimator" map is useful for understanding the broad terrain of alternate techniques and major algorithms, and when different ones apply to your task.
Also, scikit-learn contributors/educators like Jake Vanderplas (github.com/jakevdp) and Olivier Grisel (github.com/ogrisel) have many online notebooks/tutorials/archived-video-presentations which step through all the basics, often including text-classification problems much like yours.

Parameter values of Doc2vec for Document Tagging - Gensim

my task is to assign tags (descriptive words) to documents or posts from the list of available tags. I'm working with Doc2vec available in Gensim. I read that doc2vec can be used for document tagging. But i could not get the suitable parameter values for this task. Till now, i have tested it by changing value of parameters named 'size' and 'window'. The results i'm getting are too nonsense and also by changing values of these parameters i haven't find any trend in results i.e. at some values results got little bit improved and at some values results fall down. Can anyone suggest what should be suitable parameter values for this task? I found that 'size'(defines size if feature vector) should be large if we have enough training data. But about the rest of parameters, i am not getting sure!
Which parameters are best can vary with the quality & size of your training data, and exactly what your downstream goals are. (There's no one set of best-for-everything parameters.)
Starting with the gensim defaults is reasonable first guess, or other values you've see someone else having used successfully on a similar dataset/problem.
But really you'll need to experiment, ideally by creating an automated evaluation based on some held-back testing set, then meta-optimizing the Doc2Vec parameters by searching over many small adjustments to the parameters for the best ranges/combinations.

Bayes Classifier Training set

I am working on a simple naive bayes classifier and I had a conceptual question about it.
I know that the training set is extremely important so I wanted to know what constitutes a good training set in the following example. Say I am classifying web pages and concluding if they are relevant or not. The factors on which this decision is based takes into account the probabilities of certain attributes being present on that page. These would be certain keywords that increase the relevancy of the page. The keywords are apple, banana, mango. The relevant/irrelevant score is for each user. Assume that a user marks the page relevant/irrelevant equally likely.
Now for the training data, to get the best training for my classifier, would I need to have the same number of relevant results as irrelevant results? Do I need to make sure that each user would have relevant/irrelevant results present for them to make a good training set? What do I need to keep in mind?
This is a slightly endless topic as there are millions of factors involved. Python is a good example as it drives most of goolge(for what I know). And this brings us to the very beginning of google-there was an interview with Larry Page some years ago who was speaking about the search engines before google-for example when he typed the word "university", the first result he found had the word "university" a few times in it's title.
Going back to naive Bayes classifiers - there are a few very important key factors - assumptions and pattern recognition. And relations of course. For example you mentioned apples - that could have a few possibilities. For example:
Apple - if eating, vitamins, and shape is present we assume that the we are most likely talking about a fruit.
If we are mentioning electronics, screens, maybe Steve Jobs - that should be obvious.
If we are talking about religion, God, gardens, snakes - then it must have something to do with Adam and Eve.
So depending on your needs, you could have a basic segments of data where each one of these falls into, or a complex structure containing far more details. So yes-you base most of those on plain assumptions. And based on those you can create a more complex patterns for further recognition-Apple-iPod, iPad -having similar pattern in their names, containing similar keywords, mentioning certain people-most likely related to each other.
Irrelevant data is very hard to spot-at this very point you are probably thinking that I own multiple Apple devices, writing on a large iMac, while this couldn't be further from the truth. So this would be a very wrong assumption to begin with. So the classifiers themselves must make a very good segmentation and analysis before jumping to exact conclusions.

Categories

Resources