Naive Bayes classifier not working for sentiment analysis - python

I am trying to train a Naive Bayes classifier to predict whether a movie review is good or bad.
I am following this tutorial but have run into an error when trying to train the model:
https://medium.com/#MarynaL/analyzing-movie-review-data-with-natural-language-processing-7c5cba6ed922
I have followed all steps until training the model. My data and code looks as such:
Reviews Labels
0 For fans of Chris Farley, this is probably his... 1
1 Fantastic, Madonna at her finest, the film is ... 1
2 From a perspective that it is possible to make... 1
3 What is often neglected about Harold Lloyd is ... 1
4 You'll either love or hate movies such as this... 1
... ...
14995 This is perhaps the worst movie I have ever se... 0
14996 I was so looking forward to seeing this film t... 0
14997 It pains me to see an awesome movie turn into ... 0
14998 "Grande Ecole" is not an artful exploration of... 0
14999 I felt like I was watching an example of how n... 0
gnb = MultinomialNB()
gnb.fit(all_train_set['Reviews'], all_train_set['Labels'])
However when trying to fit the model I receive this error:
ValueError: could not convert string to float: 'For fans of Chris Farley, this is probably his best film. David Spade pl
If anyone could help me decide why following this tutorial has gone wrong it would e greatly appreciated.
Many Thanks

Indeed with Scikit-learn you have to convert texts to numbers before calling a classifier. You can do so by using, for instance, the CountVectorizer or the TfidfVectorizer.
If you want to use the more modern word embeddings, you can use the Zeugma package (install it with pip install zeugma in a terminal), e.g.
from zeugma.embeddings import EmbeddingTransformer
embedding = EmbeddingTransformer('glove')
X = embedding.transform(all_train_set['Reviews'])
y = all_train_set['Labels']
gnb = MultinomialNB()
gnb.fit(X, y)
I hope it helps!

Related

How to list all documents/words per topic in bert topic modelling?

I read the docs, but i can see the topics only show 3 or 4 documents per topic whereas the count is 2000+, is there a way i can see all the assigned documents, instead of three/four documents per topic?
For example: i want to see all the 2555 documents in the below picture, and get all words under the name column, not just first 3, 4 words I tried many things, but it doesn't work
as I understood you want to see n_words of the model and also documents representing the specific topics. First of all, you can list all topics b following code:
import pandas as pd
with pd.option_context('display.max_rows', None,
'display.max_columns', None,
'display.precision', 3,
):
print(freq1)
with this, you will see all the topics you have on your model.
In order to get n_words from topic 3 you can run this command:
model1.get_topic(3)
and you will get
[('hood', 0.08646080854070591),
('fort', 0.07903592661513956),
('terrorist', 0.04050269806508548),
('muslim', 0.0404965762204116),
('ft', 0.04046989026050265),
('militari', 0.03581303765985982),
('armi', 0.025703775870144486),
('base', 0.025620172464129863),
('islam', 0.024491265378088094),
('attack', 0.02280540444895898)]
output like this, n_words with its c-TF-IDF score.
you can also get all topics with their own n_words by running:
model1.get_topics()
if you want to get documents that represents topics you can run:
model1.get_representative_docs(3)
where the output will be like:
['wouldnt fort hood islam terror attack radic jihadist',
'wasnt fort hood consid terrorist attack',
'kind gun use fort hood']

Is it necessary to re-train BERT models, specifically RoBERTa model?

I am looking for a sentiment analysis code with atleast 80%+ accuracy. I tried Vader and it I found it easy and usable, however it was giving accuracy of 64% only.
Now, I was looking at some BERT models and I noticed it needs to be re-trained? Is that correct? Isn't it pre-trained? is re-training necessary?
You can use pre-trained models from HuggingFace. There are plenty to choose from. Search for emotion or sentiment models
Here is an example of a model with 26 emotions. The current implementation works but is very slow for large datasets.
import pandas as pd
from transformers import RobertaTokenizerFast, TFRobertaForSequenceClassification, pipeline
tokenizer = RobertaTokenizerFast.from_pretrained("arpanghoshal/EmoRoBERTa")
model = TFRobertaForSequenceClassification.from_pretrained("arpanghoshal/EmoRoBERTa")
emotion = pipeline('sentiment-analysis',
model='arpanghoshal/EmoRoBERTa')
# example data
DATA_URI = "https://github.com/AFAgarap/ecommerce-reviews-analysis/raw/master/Womens%20Clothing%20E-Commerce%20Reviews.csv"
dataf = pd.read_csv(DATA_URI, usecols=["Review Text",])
# This is super slow, I will find a better optimization ASAP
dataf = (dataf
.head(50) # comment this out for the whole dataset
.assign(Emotion = lambda d: (d["Review Text"]
.fillna("")
.map(lambda x: emotion(x)[0].get("label", None))
),
)
)
We could also refactor it a bit
...
# a bit faster than the previous but still slow
def emotion_func(text:str) -> str:
if not text:
return None
return emotion(text)[0].get("label", None)
dataf = (dataf
.head(50) # comment this out for the whole dataset
.assign(Emotion = lambda d: (d["Review Text"]
.map(emotion_func)
),
)
)
Results:
Review Text Emotion
0 Absolutely wonderful - silky and sexy and comf... admiration
1 Love this dress! it's sooo pretty. i happene... love
2 I had such high hopes for this dress and reall... fear
3 I love, love, love this jumpsuit. it's fun, fl... love
...
6 I aded this in my basket at hte last mintue to... admiration
7 I ordered this in carbon for store pick up, an... neutral
8 I love this dress. i usually get an xs but it ... love
9 I'm 5"5' and 125 lbs. i ordered the s petite t... love
...
16 Material and color is nice. the leg opening i... neutral
17 Took a chance on this blouse and so glad i did... admiration
...
26 I have been waiting for this sweater coat to s... excitement
27 The colors weren't what i expected either. the... disapproval
...
31 I never would have given these pants a second ... love
32 These pants are even better in person. the onl... disapproval
33 I ordered this 3 months ago, and it finally ca... disappointment
34 This is such a neat dress. the color is great ... admiration
35 Wouldn't have given them a second look but tri... love
36 This is a comfortable skirt that can span seas... approval
...
40 Pretty and unique. great with jeans or i have ... admiration
41 This is a beautiful top. it's unique and not s... admiration
42 This poncho is so cute i love the plaid check ... love
43 First, this is thermal ,so naturally i didn't ... love
you can use pickle.
Pickle lets you.. well pickle your model for later use and in fact, you can use a loop to keep training the model until it reaches a certain accuracy and then exit the loop and pickle the model for later use.
You can find many tutorials on youtube on how to pickel a model.

TypeError in torch.argmax() when want to find the tokens with the highest `start` score

I want to run this code for question answering using hugging face transformers.
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer
#Model
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
#Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
question = '''Why was the student group called "the Methodists?"'''
paragraph = ''' The movement which would become The United Methodist Church began in the mid-18th century within the Church of England.
A small group of students, including John Wesley, Charles Wesley and George Whitefield, met on the Oxford University campus.
They focused on Bible study, methodical study of scripture and living a holy life.
Other students mocked them, saying they were the "Holy Club" and "the Methodists", being methodical and exceptionally detailed in their Bible study, opinions and disciplined lifestyle.
Eventually, the so-called Methodists started individual societies or classes for members of the Church of England who wanted to live a more religious life. '''
encoding = tokenizer.encode_plus(text=question,text_pair=paragraph)
inputs = encoding['input_ids'] #Token embeddings
sentence_embedding = encoding['token_type_ids'] #Segment embeddings
tokens = tokenizer.convert_ids_to_tokens(inputs) #input tokens
start_scores, end_scores = model(input_ids=torch.tensor([inputs]), token_type_ids=torch.tensor([sentence_embedding]))
start_index = torch.argmax(start_scores)
but I get this error at the last line:
Exception has occurred: TypeError
argmax(): argument 'input' (position 1) must be Tensor, not str
File "D:\bert\QuestionAnswering.py", line 33, in <module>
start_index = torch.argmax(start_scores)
I don't know what's wrong. can anyone help me?
BertForQuestionAnswering returns a QuestionAnsweringModelOutput object.
Since you set the output of BertForQuestionAnswering to start_scores, end_scores, the return QuestionAnsweringModelOutput object is forced convert to a tuple of strings ('start_logits', 'end_logits') causing the type mismatch error.
The following should work:
outputs = model(input_ids=torch.tensor([inputs]), token_type_ids=torch.tensor([sentence_embedding]))
start_index = torch.argmax(outputs.start_logits)
Huggingface transformers provide a simple high-level way of running the model, as shown in this guide:
from transformers import pipeline
nlp = pipeline('question-answering', model=model, tokenizer=tokenizer)
print(nlp(question=question, context=paragraph, topk=5))
topk allows to select several top-scoring answers.

How can I find the probability of a sentence using GPT-2?

I'm trying to write a program that, given a list of sentences, returns the most probable one. I want to use GPT-2, but I am quite new to using it (as in I don't really know how to do it). I'm planning on finding the probability of a word given the previous words and multiplying all the probabilities together to get the overall probability of that sentence occurring, however I don't know how to find the probability of a word occurring given the previous words. This is my (psuedo) code:
sentences = # my list of sentences
max_prob = 0
best_sentence = sentences[0]
for sentence in sentences:
prob = 1 #probability of that sentence
for idx, word in enumerate(sentence.split()[1:]):
prob *= probability(word, " ".join(sentence[:idx])) # this is where I need help
if prob > max_prob:
max_prob = prob
best_sentence = sentence
print(best_sentence)
Can I have some help please?
You can also try lm-scorer, a tiny wrapper around transformers that allows you to get sentences probabilities using models that support it (only GPT2 models are implemented at the time of writing).
https://github.com/simonepri/lm-scorer
I just used it myself and works perfectly.
Warning: If you use other transformers / pipelines in the same environment, things may get messy.
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
def score(tokens_tensor):
loss=model(tokens_tensor, labels=tokens_tensor)[0]
return np.exp(loss.cpu().detach().numpy())
texts = ['i would like to thank you mr chairman', 'i would liking to thanks you mr chair in', 'thnks chair' ]
for text in texts:
tokens_tensor = tokenizer.encode( text, add_special_tokens=False, return_tensors="pt")
print (text, score(tokens_tensor))
This code snippet could be an example of what are you looking for. You feed the model with a list of sentences, and it scores each whereas the lowest the better.
The output of the code above is:
i would like to thank you mr chairman 122.3066
i would liking to thanks you mr chair in 1183.7637
thnks chair 14135.129
I wrote a set of functions that can do precisely what you're looking for. Recall that GPT-2 parses its input into tokens (not words): the last word in 'Joe flicked the grasshopper' is actually three tokens: ' grass', 'ho', and 'pper'. The cloze_finalword function takes this into account, and computes the probabilities of all tokens (conditioned on the tokens appearing before them). You can adapt part of this function so that it returns what you're looking for. I hope you find the code useful!
I think GPT-2 is a bit overkill for what you're trying to achieve. You can build a basic language model which will give you sentence probability using NLTK. A tutorial for this can be found here.

Naive Bayes classifier extractive summary

I am trying to train a naive bayes classifier and I am having troubles with the data. I plan to use it for extractive text summarization.
Example_Input: It was a sunny day. The weather was nice and the birds were singing.
Example_Output: The weather was nice and the birds were singing.
I have a dataset that I plan to use and in every document there is at least 1 sentence for summary.
I decided to use sklearn but I don't know how to represent the data that I have. Namely X and y.
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X, y)
The closest to my mind is to make it like this:
X = [
'It was a sunny day. The weather was nice and the birds were singing.',
'I like trains. Hi, again.'
]
y = [
[0,1],
[1,0]
]
where the target values mean 1 - included in the summary and 0 - not included. This unfortunately gives bad shape exception beacause y is expected to be 1-d array. I cannot think of a way of representing it as such so please help.
btw, I don't use the string values in X directly but represent them as vectors with CountVectorizer and TfidfTransformer from sklearn.
As per your requirement, you are classifying the data. That means, you need to separate each sentence to predict it's class.
for example:
Instead of using:
X = [
'It was a sunny day. The weather was nice and the birds were singing.',
'I like trains. Hi, again.'
]
Use it as following:
X = [
'It was a sunny day.',
'The weather was nice and the birds were singing.',
'I like trains.',
'Hi, again.'
]
Use sentence tokenizer of NLTK to achieve this.
Now, for labels, use two-classes. let say 1 for yes, 0 for no.
y = [
[0,],
[1,],
[1,],
[0,]
]
Now, use this data to fit and predict the way you want!
Hope it helps!

Categories

Resources