This is for an assignment. I have ~100 publications on a dummy website. Each publication has been given an artificial score indicating how successful it is. I need to predict what affects that value.
So far, I've scraped the information that might affect the score and saved them into individual lists. The relevant part of my loop looks like:
for publicationurl:
Predictor 1 = 3.025
Predictor 2 = Journal A
Predictor 3 = 0
Response Variable = 42.5
Title = Sentence
Abstract = Paragraph
I can resolve most of that by putting predictors 1-3 and the response into a dataframe then doing regression. The bit that is tripping me up is the Title and Abstract text. I can strip them of punctuation and remove stopwords, but after that I'm not sure how to actually analyse them alongside the other predictors. I was looking into doing some kind of text-similarity comparison between high-high and high-low scoring pairs and basing whether the title and abstract affect the score based off of that, but am hoping there is a much neater method that allows me to actually put that text into a predictive model too.
I currently have 5 predictors besides the text and there are ~40,000 words in total across all titles and abstracts if any of that affects what kind of method works best. Ideally I'd like to end up being able to put everything into a single predictive model, but any method that can lead me to a working solution is good.
This would be an ideal situation for using Multinomial Naive Bayes. This is a relatively simple yet quite powerful method to classify the class of a text. If this is an introductory exercise I'm 99% sure that you're prof is expecting something with NB to solve the given problem.
I would recommend a library like sklearn which should make the task almost trivial. If you're interested in the intuition behind NB, this YouTube video should serve as a good introduction.
Start off by going over some examples/blog posts, google should provide you with countless examples. Then modify the code to fit your use case.
You could either group the articles in two classes e.g. score <=5 = bad, score > 5 = good. A next step would be to predict more than two classes like explained here.
Related
This is more of a guideline question rather than a technical query. I am looking to create a classification model that classifies documents based on a specific list of strings. However, as it turns out from the data, that that output is more useful when you pass labelled intents / topics, instead of allowing the model to guess. Based on my research so far, I found that Bertopic might be the right way to achieve this as it allows guided topic modeling but the only caveat is that the guided topics should contain words of similar meaning (see link below).
It might be more clear from the example below on what I want to achieve. Suppose we have a chat text extract from a conversation between a customer and store associate, below is what the customer asking for.
Hi, the product I purchased from your store is defective and I would like to get it replaced or refunded. Here is the receipt number and other details...
If my list of intended labels is as follows: ['defective or damaged', 'refund request', 'wrong label description',...], we can see that the above extract qualifies into 'defective or damaged' and 'refund request'. For the sake of simplicity, I would pick the one that model returns with highest score so that we only have 1 label per request. Now, if my data does not have these labels defined, I may be able to use zero-shot classification to "best guess" the intent from this list. However, as I understand the use of zero-shot classification or even the guided topic modeling of BERTopic, the categories that I want above may not be derived since the individual words in those categories do not mean that same.
For example, in BERTopic, an intended classified label could be like ["space", "launch", "orbit", "lunar"] for a "Space Related" topic, but in my case let us say for the 3rd label it would be ["wrong", "label", "description"] which would not be best suited as it would try to find all records that have mentions of wrong address, wrong department, wrong color etc., so I am essentially looking for a combination of those 3 words in the context. Also, those 3 words may not always be together or in same order. For example, in this sentence -
The item had description which was labelled incorrectly.
This same challenge would be for zero-shot classification where the labels are expected to be one word or a combination of words that mean the same thing. Let me know if this clarifies the question more or if I can help further clarify it.
Approach mentioned above:
https://maartengr.github.io/BERTopic/getting_started/guided/guided.html#semi-supervised-topic-modeling
I collected some product reviews of a website from different users, and I'm trying to find similarities between products through the use of the embeddings of the words used by the users.
I grouped each review per product, such that I can have different reviews succeeding one after the other in my dataframe (i.e: different authors for one product). Furthermore, I also already tokenized the reviews (and all other pre-processing methods). Below is a mock-up dataframe of what I'm having (the list of tokens per product is actually very high, as well as the number of products):
Product
reviews_tokenized
XGame3000
absolutely amazing simulator feel inaccessible ...
Poliamo
production value effect tend cover rather ...
Artemis
absolutely fantastic possibly good oil ...
Ratoiin
ability simulate emergency operator town ...
However, I'm not sure of what would be the most efficient between doc2Vec and Word2Vec. I would initially go for Doc2Vec, since it has the ability to find similarities by taking into account the paragraph/sentence, and find the topic of it (which I'd like to have, since I'm trying to cluster products by topics), but I'm a bit worry about the fact that the reviews are from different authors, and thus might bias the embeddings? Note that I'm quite new to NLP and embeddings, so some notions may escape me. Below is my code for Doc2Vec, which giving me a quite good silhouette score (~0.7).
product_doc = [TaggedDocument(doc.split(' '), [i]) for i, doc in enumerate(df.tokens)]
model3 = Doc2Vec(min_count=1, seed = SEED, ns_exponent = 0.5)
model3.build_vocab(product_doc)
model3.train(product_doc, total_examples=model3.corpus_count, epochs=model3.epochs)
product2vec = [model3.infer_vector((df['tokens'][i].split(' '))) for i in range(0,len(df['tokens']))]
dtv = np.array(product2vec)
What do you think would be the most efficient method to tackle this? If something is not clear enough, or else, please tell me.
Thank you for your help.
EDIT: Below is the clusters I'm obtaining:
There's no way to tell which particular mix of methods will work best for a specific dataset and particular end-goal: you really have to try them against each other, in your own reusable pipeline for scoring them against your desired results.
It looks like you've already stripped the documents down to keywords rather than original natural text, which could hurt with these algorithms - you may want to try it both ways.
Depending on the size & format of your texts, you may also want to look at doing "Word Mover's Distance" (WMD) comparisons between sentences (or other small logical chunks of your data). Some work has demo'd interesting results in finding "similar concerns" (even with different wording) in the review domain, eg: https://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/
Note, though, WMD gets quite costly to calculate in bulk with larger texts.
I'm trying to use Python's NLTK to do some answer type classification. Essentially train it on a bunch of questions then give some unseen questions.
The issue I'm having is that is pretty much just classifies any question as whichever is the most common answer type. So if there's 200 questions marked as 'people' and 150 marked as 'place' then EVERY test question is marked as having the answer type 'people'.
I know that balanced data is better, but it seems like a very tight restriction (as well as not feasible, given the amount of test data I'm using). The training data I'm using is this set of 5500 questions here:
http://cogcomp.cs.illinois.edu/Data/QA/QC/train_5500.label
And this is my Python code:
import nltk
train = []
with open('data.txt') as f:
content = f.readlines()
for c in content:
parts = c.split(' ', 1)
train.append((dict(q=parts[1].rstrip()), parts[0]))
test = [
(dict(q='When was the congress of Vienna?')),
(dict(q='What is the capital of Australia?')),
(dict(q='Why doesn\'t this classifier work?'))
]
classifier = nltk.classify.NaiveBayesClassifier.train(train)
print classifier.classify_many(test)
It assigns all 3 of the test questions the 'HUM:ind' class, which is the most common question in the training set. If I reduce the number of these HUM:ind questions, it then just starts saying it's the next most popular. It only takes a couple of questions of discrepancy before that answer type overpowers all the others.
Am I missing something? Am I not using the algorithm right? Is there some parameter I should change given the format of my training data? My example is pretty similar to a couple of examples I've seen online. Any help appreciated
You always get the most frequent category back because you are not giving your classifier any useful features to work with: If you have to guess with no evidence at all, the most common class is the right answer.
The classifier can only reason about feature names and feature values it has seen before. (New data consists of known features in combinations that it has not seen before.) But your code only defines one "feature", q, and the value in each case is the entire text of the question. So all test questions are unknown (and therefore indistinguishable) feature values. You can't get something for nothing.
Learn how to train a classifier, (and how classification works while you're at it), and the problem will go away.
I've this CSV file which has comments (tweets, comments). I want to classify them into 4 categories, viz.
Pre Sales
Post Sales
Purchased
Service query
Now the problems that I'm facing are these :
There is a huge number of overlapping words between each of the
categories, hence using NaiveBayes is failing.
The size of tweets being only 160 chars, what is the best way to
prevent words from one category falling into the another.
What all ways should I select the features which can take care of both the 160 char tweets and a bit lengthier facebook comments.
Please let me know of any reference link/tutorial link to follow up the same, being a newbee in this field
Thanks
I wouldn't be so quick to write off Naive Bayes. It does fine in many domains where there are lots of weak clues (as in "overlapping words"), but no absolutes. It all depends on the features you pass it. I'm guessing you are blindly passing it the usual "bag of words" features, perhaps after filtering for stopwords. Well, if that's not working, try a little harder.
A good approach is to read a couple of hundred tweets and see how you know which category you are looking at. That'll tell you what kind of things you need to distill into features. But be sure to look at lots of data, and focus on the general patterns.
An example (but note that I haven't looked at your corpus): Time expressions may be good clues on whether you are pre- or post-sale, but they take some work to detect. Create some features "past expression", "future expression", etc. (in addition to bag-of-words features), and see if that helps. Of course you'll need to figure out how to detect them first, but you don't have to be perfect: You're after anything that can help the classifier make a better guess. "Past tense" would probably be a good feature to try, too.
This is going to be a complex problem.
How do you define the categories? Get as many tweets and FB posts as you can and tag them all with the correct categories to get some ground truth data
Then you can identify which words/phrases are the best for identifying particular category using e.g. PCA
Look into scikit-learn they have tutorials for text processing and classification.
I am working on a simple naive bayes classifier and I had a conceptual question about it.
I know that the training set is extremely important so I wanted to know what constitutes a good training set in the following example. Say I am classifying web pages and concluding if they are relevant or not. The factors on which this decision is based takes into account the probabilities of certain attributes being present on that page. These would be certain keywords that increase the relevancy of the page. The keywords are apple, banana, mango. The relevant/irrelevant score is for each user. Assume that a user marks the page relevant/irrelevant equally likely.
Now for the training data, to get the best training for my classifier, would I need to have the same number of relevant results as irrelevant results? Do I need to make sure that each user would have relevant/irrelevant results present for them to make a good training set? What do I need to keep in mind?
This is a slightly endless topic as there are millions of factors involved. Python is a good example as it drives most of goolge(for what I know). And this brings us to the very beginning of google-there was an interview with Larry Page some years ago who was speaking about the search engines before google-for example when he typed the word "university", the first result he found had the word "university" a few times in it's title.
Going back to naive Bayes classifiers - there are a few very important key factors - assumptions and pattern recognition. And relations of course. For example you mentioned apples - that could have a few possibilities. For example:
Apple - if eating, vitamins, and shape is present we assume that the we are most likely talking about a fruit.
If we are mentioning electronics, screens, maybe Steve Jobs - that should be obvious.
If we are talking about religion, God, gardens, snakes - then it must have something to do with Adam and Eve.
So depending on your needs, you could have a basic segments of data where each one of these falls into, or a complex structure containing far more details. So yes-you base most of those on plain assumptions. And based on those you can create a more complex patterns for further recognition-Apple-iPod, iPad -having similar pattern in their names, containing similar keywords, mentioning certain people-most likely related to each other.
Irrelevant data is very hard to spot-at this very point you are probably thinking that I own multiple Apple devices, writing on a large iMac, while this couldn't be further from the truth. So this would be a very wrong assumption to begin with. So the classifiers themselves must make a very good segmentation and analysis before jumping to exact conclusions.