I am doing sentiment analysis on twitter data using python NLTK. I need a dictionary which contains +ve and -ve polarities of words. I have read so much stuff regarding sentiwordnet but when I am using it for my project it is not giving efficient and fast results. I think I'm not using it correctly. Can anyone tell me correct way to use it? Here are the steps I did up to now:
tokenization of tweets
POS tagging of tokens
passing each tags to sentinet
I am using the nltk package for tokenization and tagging. See a part of my code below:
import nltk
from nltk.stem import *
from nltk.corpus import sentiwordnet as swn
tokens=nltk.word_tokenize(row) #for tokenization, row is line of a file in which tweets are saved.
tagged=nltk.pos_tag(tokens) #for POSTagging
for i in range(0,len(tagged)):
if 'NN' in tagged[i][1] and len(swn.senti_synsets(tagged[i][0],'n'))>0:
pscore+=(list(swn.senti_synsets(tagged[i][0],'n'))[0]).pos_score() #positive score of a word
nscore+=(list(swn.senti_synsets(tagged[i][0],'n'))[0]).neg_score() #negative score of a word
elif 'VB' in tagged[i][1] and len(swn.senti_synsets(tagged[i][0],'v'))>0:
pscore+=(list(swn.senti_synsets(tagged[i][0],'v'))[0]).pos_score()
nscore+=(list(swn.senti_synsets(tagged[i][0],'v'))[0]).neg_score()
elif 'JJ' in tagged[i][1] and len(swn.senti_synsets(tagged[i][0],'a'))>0:
pscore+=(list(swn.senti_synsets(tagged[i][0],'a'))[0]).pos_score()
nscore+=(list(swn.senti_synsets(tagged[i][0],'a'))[0]).neg_score()
elif 'RB' in tagged[i][1] and len(swn.senti_synsets(tagged[i][0],'r'))>0:
pscore+=(list(swn.senti_synsets(tagged[i][0],'r'))[0]).pos_score()
nscore+=(list(swn.senti_synsets(tagged[i][0],'r'))[0]).neg_score()
At the end I will be calculating how many tweets are positive and how many tweets are negative.
Where am I wrong? How should I use it? And is there any other similar kind of dictionary which is easy to use?
Yes, there are other lexicons that you can use. You can find a small list of lexicons here: http://sentiment.christopherpotts.net/lexicons.html#resources
It seems Bing Liu's Opinion Lexicon is quite easy to use.
Apart from linking to those lexicons that website is a very nice tutorial on sentiment analysis.
calculate the sentiment
alist = [all_tokens_in_doc]
totalScore = 0
count_words_included = 0
for word in all_words_in_comment:
synset_forms = list(swn.senti_synsets(word[0], word[1]))
if not synset_forms:
continue
synset = synset_forms[0]
totalScore = totalScore + synset.pos_score() - synset.neg_score()
count_words_included = count_words_included +1
final_dec = ''
if count_words_included == 0:
final_dec = 'N/A'
elif totalScore == 0:
final_dec = 'Neu'
elif totalScore/count_words_included < 0:
final_dec = 'Neg'
elif totalScore/count_words_included > 0:
final_dec = 'Pos'
return final_dec
Related
i used sentiment analysis using python from twitter to judge whether its negative, positive, or neutral, with this code:
hasilAnalisis = []
for tweets in hasilUser:
tweets_properties = {}
tweets_properties['tanggal_tweet'] = tweets.created_at
tweets_properties['pengguna'] = tweets.user.screen_name
tweets_properties['isi_tweet'] = tweets.text
tweets_full_cleansing = ' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",tweets.text).split())
analysis = TextBlob(tweets_full_cleansing)
try:
analysis = analysis.translate(to='en')
except Exception as e:
print(e)
if analysis.sentiment.polarity > 0.0:
tweets_properties["sentimen"] = "positif"
elif analysis.sentiment.polarity == 0.0:
tweets_properties["sentimen"] = "Netral"
else:
tweets_properties["sentimen"] = "Negatif"
if tweets.retweet_count > 0:
if tweets_properties not in hasilAnalisis:
hasilAnalisis.append(tweets_properties)
else:
hasilAnalisis.append(tweets_properties)
the problem is, since the tweets that i want to find out is in indonesian, so i have to translate it first into english using this code
try:
analysis = analysis.translate(to='en')
except Exception as e:
print(e)
after that the code can be judge or valued based on sentiment because there's already converted to english and analysis.sentiment.polarity can be done.
is there any way to do this without translate it into english first?
based on this https://ksnugroho.medium.com/dasar-text-preprocessing-dengan-python-a4fa52608ffe
i can use sastrawi for tokenization, but no idea how to use sentiment polarity in indonesian language.
I investigated your question and I find a method for this. You can use:
Sentiment Lexicon for Indonesian in This Github link
Sentiment lexicon is usually has for every language.
I apologise in advance for posting so much code.
I am trying to classify YouTube comments into ones that contain opinion (be it positive or negative) and ones that don't using NLTK's Naive Bayes classifier, but no matter what I do during the preprocessing stage I can't really get the accuracy above 0.75. This seems kinda low compared to other examples I have seen - this tutorial ends up with an accuracy of around 0.98 for example.
Here is my full code
import nltk, re, json, random
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tag import pos_tag
from nltk.tokenize import TweetTokenizer
from nltk import FreqDist, classify, NaiveBayesClassifier
from contractions import CONTRACTION_MAP
from abbreviations import abbrev_map
from tqdm.notebook import tqdm
def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
text = re.sub(r"’", "'", text)
if text in abbrev_map:
return(abbrev_map[text])
text = re.sub(r"\bluv", "lov", text)
contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())),
flags=re.IGNORECASE|re.DOTALL)
def expand_match(contraction):
match = contraction.group(0)
first_char = match[0]
expanded_contraction = contraction_mapping.get(match)\
if contraction_mapping.get(match)\
else contraction_mapping.get(match.lower())
expanded_contraction = first_char+expanded_contraction[1:]
return expanded_contraction
expanded_text = contractions_pattern.sub(expand_match, text)
return expanded_text
def reduce_lengthening(text):
pattern = re.compile(r"(.)\1{2,}")
return pattern.sub(r"\1\1", text)
def lemmatize_sentence(tokens):
lemmatizer = WordNetLemmatizer()
lemmatized_sentence = []
for word, tag in pos_tag(tokens):
if tag.startswith('NN'):
pos = 'n'
elif tag.startswith('VB'):
pos = 'v'
else:
pos = 'a'
lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
return lemmatized_sentence
def processor(comments_list):
new_comments_list = []
for com in tqdm(comments_list):
com = com.lower()
#expand out contractions
tok = com.split(" ")
z = []
for w in tok:
ex_w = expand_contractions(w)
z.append(ex_w)
st = " ".join(z)
tokenized = tokenizer.tokenize(st)
reduced = [reduce_lengthening(token) for token in tokenized]
new_comments_list.append(reduced)
lemmatized = [lemmatize_sentence(new_com) for new_com in new_comments_list]
return(lemmatized)
def get_all_words(cleaned_tokens_list):
for tokens in cleaned_tokens_list:
for token in tokens:
yield token
def get_comments_for_model(cleaned_tokens_list):
for comment_tokens in cleaned_tokens_list:
yield dict([token, True] for token in comment_tokens)
if __name__ == "__main__":
#=================================================================================~
tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)
with open ("english_lang/samples/training_set.json", "r", encoding="utf8") as f:
train_data = json.load(f)
pos_processed = processor(train_data['pos'])
neg_processed = processor(train_data['neg'])
neu_processed = processor(train_data['neu'])
emotion = pos_processed + neg_processed
random.shuffle(emotion)
em_tokens_for_model = get_comments_for_model(emotion)
neu_tokens_for_model = get_comments_for_model(neu_processed)
em_dataset = [(comment_dict, "Emotion")
for comment_dict in em_tokens_for_model]
neu_dataset = [(comment_dict, "Neutral")
for comment_dict in neu_tokens_for_model]
dataset = em_dataset + neu_dataset
random.shuffle(dataset)
x = 700
tr_data = dataset[:x]
te_data = dataset[x:]
classifier = NaiveBayesClassifier.train(tr_data)
print(classify.accuracy(classifier, te_data))
I can post my training data set if needed, but it's probably worth mentioning that the quality of English is very poor and inconsistent in the YouTube comments themselves (which I imagine is the reason for the low model accuracy). In any case, would this be considered an acceptable level of accuracy?
Alternatively, I may well be going about this all wrong and there is a far superior model to be using, in which case feel free to tell me I am an idiot!
Thanks in advance
It is not statistically valid to compare your results against those of an unrelated tutorial. Before you panic, please do appropriate research on the factors that can reduce a model's accuracy. First and foremost, your model cannot exhibit an accuracy higher than that inherent in the data set's information. For instance, no model can perform (in the long run) better than 50% in predicting a random binary event, regardless of the data set.
We have no reasonable way to evaluate the theoretical information content. If you need a check, try applying some other model types to the same data, and see what they produce for accuracy. Running these experiments is a normal part of data science.
I'm using spacy tokenizer to tokenize my data, and then build vocab.
This is my code:
import spacy
nlp = spacy.load("en_core_web_sm")
def build_vocab(docs, max_vocab=10000, min_freq=3):
stoi = {'<PAD>':0, '<UNK>':1}
itos = {0:'<PAD>', 1:'<UNK>'}
word_freq = {}
idx = 2
for sentence in docs:
for word in [i.text.lower() for i in nlp(sentence)]:
if word not in word_freq:
word_freq[word] = 1
else:
word_freq[word] += 1
if word_freq[word] == min_freq:
if len(stoi) < max_vocab:
stoi[word] = idx
itos[idx] = word
idx += 1
return stoi, itos
But it takes hours to complete since I have more than 800000 sentences.
Is there a faster and better way to achieve this? Thanks.
update: tried to remove min_freq:
def build_vocab(docs, max_vocab=10000):
stoi = {'<PAD>':0, '<UNK>':1}
itos = {0:'<PAD>', 1:'<UNK>'}
idx = 2
for sentence in docs:
for word in [i.text.lower() for i in nlp(sentence)]:
if word not in stoi:
if len(stoi) < max_vocab:
stoi[word] = idx
itos[idx] = word
idx += 1
return stoi, itos
still takes a long time, does spacy have a function to build vocab like in torchtext (.build_vocab).
There are a couple of things you can do to make this faster.
import spacy
from collections import Counter
def build_vocab(texts, max_vocab=10000, min_freq=3):
nlp = spacy.blank("en") # just the tokenizer
wc = Counter()
for doc in nlp.pipe(texts):
for word in doc:
wc[word.lower_] += 1
word2id = {}
id2word = {}
for word, count in wc.most_common():
if count < min_freq: break
if len(word2id) >= max_vocab: break
wid = len(word2id)
word2id[word] = wid
id2word[wid] = word
return word2id, id2word
Explanation:
If you only use the tokenizer you can use spacy.blank
nlp.pipe is fast for lots of text (less important, maybe irrelevant with blank model though)
Counter is optimized for this kind of counting task
Another thing is that the way you are building your vocab in your initial example, you will take the first N words that have enough tokens, not the top N words, which is probably wrong.
Another thing is that if you're using spaCy you shouldn't build your vocab this way - spaCy has its own built-in vocab class that handles converting tokens to IDs. I guess you might need this mapping for a downstream task or something but look at the vocab docs to see if you can use that instead.
Just to preface, this code is from a great guy on Github / Youtube:
https://github.com/the-javapocalypse/
I made some minor tweaks for my personal use.
One thing that has always stood between myself and sentiment analysis on twitter is the fact that so many bots posts exist. I figure if I cannot avoid the bots altogether, maybe I can just remove duplication to hedge the impact.
For example - "#bitcoin" or "#btc" - Bot accounts exist under many different handles posting the same exact tweet. It could say "It's going to the moon! Buy now #btc or forever regret it! Buy, buy, buy! Here's a link to my personal site [insert personal site url here]"
This would seem like a positive sentiment post. If 25 accounts post this 2 times per account, we have some inflation if I am only analyzing the recent 500 tweets containing "#btc"
So to my question:
What is an effective way to remove duplication before writing to the csv file? I was thinking of inputting a simple if statement and point to an array to check if it exists already. There is an issue with this. Say I input 1000 tweets to analyze. If 500 of these are duplication from bots, my 1000 tweet analysis just became a 501 tweet analysis. This leads to my next question
What is a way to include a check for duplication and if there is duplication add 1 each time to my total request for tweets to analyze. Example - I want to analyze 1000 tweets. Duplication was found one time, so there are 999 unique tweets to include in the analysis. I want the script to analyze one more to make it 1000 unique tweets (1001 tweets including the 1 duplicate)
Small change, but I think it would be effective to know how to remove all tweets with hyperlinks embedded. This would play into the objective of question 2 by compensating for dropping hyperlink tweets. Example - I want to analyze 1000 tweets. 500 of the 1000 have embedded URLs. The 500 are removed from the analysis. I am now down to 500 tweets. I still want 1000. Script needs to keep fetching non URL, non duplicates until 1000 unique non URL tweets have been accounted for.
See below for the entire script:
import tweepy
import csv
import re
from textblob import TextBlob
import matplotlib.pyplot as plt
class SentimentAnalysis:
def __init__(self):
self.tweets = []
self.tweetText = []
def DownloadData(self):
# authenticating
consumerKey = ''
consumerSecret = ''
accessToken = ''
accessTokenSecret = ''
auth = tweepy.OAuthHandler(consumerKey, consumerSecret)
auth.set_access_token(accessToken, accessTokenSecret)
api = tweepy.API(auth)
# input for term to be searched and how many tweets to search
searchTerm = input("Enter Keyword/Tag to search about: ")
NoOfTerms = int(input("Enter how many tweets to search: "))
# searching for tweets
self.tweets = tweepy.Cursor(api.search, q=searchTerm, lang="en").items(NoOfTerms)
csvFile = open('result.csv', 'a')
csvWriter = csv.writer(csvFile)
# creating some variables to store info
polarity = 0
positive = 0
negative = 0
neutral = 0
# iterating through tweets fetched
for tweet in self.tweets:
# Append to temp so that we can store in csv later. I use encode UTF-8
self.tweetText.append(self.cleanTweet(tweet.text).encode('utf-8'))
analysis = TextBlob(tweet.text)
# print(analysis.sentiment) # print tweet's polarity
polarity += analysis.sentiment.polarity # adding up polarities
if (analysis.sentiment.polarity == 0): # adding reaction
neutral += 1
elif (analysis.sentiment.polarity > 0.0):
positive += 1
else:
negative += 1
csvWriter.writerow(self.tweetText)
csvFile.close()
# finding average of how people are reacting
positive = self.percentage(positive, NoOfTerms)
negative = self.percentage(negative, NoOfTerms)
neutral = self.percentage(neutral, NoOfTerms)
# finding average reaction
polarity = polarity / NoOfTerms
# printing out data
print("How people are reacting on " + searchTerm +
" by analyzing " + str(NoOfTerms) + " tweets.")
print()
print("General Report: ")
if (polarity == 0):
print("Neutral")
elif (polarity > 0.0):
print("Positive")
else:
print("Negative")
print()
print("Detailed Report: ")
print(str(positive) + "% positive")
print(str(negative) + "% negative")
print(str(neutral) + "% neutral")
self.plotPieChart(positive, negative, neutral, searchTerm, NoOfTerms)
def cleanTweet(self, tweet):
# Remove Links, Special Characters etc from tweet
return ' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t]) | (\w +:\ / \ / \S +)", " ", tweet).split())
# function to calculate percentage
def percentage(self, part, whole):
temp = 100 * float(part) / float(whole)
return format(temp, '.2f')
def plotPieChart(self, positive, negative, neutral, searchTerm, noOfSearchTerms):
labels = ['Positive [' + str(positive) + '%]', 'Neutral [' + str(neutral) + '%]',
'Negative [' + str(negative) + '%]']
sizes = [positive, neutral, negative]
colors = ['yellowgreen', 'gold', 'red']
patches, texts = plt.pie(sizes, colors=colors, startangle=90)
plt.legend(patches, labels, loc="best")
plt.title('How people are reacting on ' + searchTerm +
' by analyzing ' + str(noOfSearchTerms) + ' Tweets.')
plt.axis('equal')
plt.tight_layout()
plt.show()
if __name__ == "__main__":
sa = SentimentAnalysis()
sa.DownloadData()
Answer to your first question
You can remove duplicates by using this one liner.
self.tweets = list(set(self.tweets))
This will remove every duplicated tweet. Just in case if you want to see it working, here is a simple example
>>> tweets = ['this is a tweet', 'this is a tweet', 'Yet another Tweet', 'this is a tweet']
>>> print(tweets)
['this is a tweet', 'this is a tweet', 'Yet another Tweet', 'this is a tweet']
>>> tweets = list(set(tweets))
>>> print(tweets)
['this is a tweet', 'Yet another Tweet']
Answer to your second question
Since now you have already removed duplicates, you can get the number of removed tweets by taking the difference of self.tweets and NoOfTerms
tweets_to_further_scrape = NoOfTerms - self.tweets
Now you can scrape tweets_to_further_scrape number of tweets and repeat this process of removing duplication and scraping until you have found desired number of unique tweets.
Answer to your third question
When iterating tweets list, add this line to remove external links.
tweet.text = ' '.join([i for i in tweet.text.split() if 'http' not in i])
Hope this will help you out. Happy coding!
You could simply keep a running count of the tweet instances using a defaultdict. You may want to remove the web addresses as well, in case they are blasting out new shortened URLs.
from collections import defaultdict
def __init__(self):
...
tweet_count = defaultdict(int)
def track_tweet(self, tweet):
t = self.clean_tweet(tweet)
self.tweet_count[t] += 1
def clean_tweet(self, tweet):
t = tweet.lower()
# any other tweet normalization happens here, such as dropping URLs
return t
def DownloadData(self):
...
for tweet in self.tweets:
...
# add logic to check for number of repeats in the dictionary.
Im new to python and need help!
i was practicing with python NLTK text classification.
Here is the code example i am practicing on
http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/
Ive tried this one
from nltk import bigrams
from nltk.probability import ELEProbDist, FreqDist
from nltk import NaiveBayesClassifier
from collections import defaultdict
train_samples = {}
with file ('positive.txt', 'rt') as f:
for line in f.readlines():
train_samples[line]='pos'
with file ('negative.txt', 'rt') as d:
for line in d.readlines():
train_samples[line]='neg'
f=open("test.txt", "r")
test_samples=f.readlines()
def bigramReturner(text):
tweetString = text.lower()
bigramFeatureVector = {}
for item in bigrams(tweetString.split()):
bigramFeatureVector.append(' '.join(item))
return bigramFeatureVector
def get_labeled_features(samples):
word_freqs = {}
for text, label in train_samples.items():
tokens = text.split()
for token in tokens:
if token not in word_freqs:
word_freqs[token] = {'pos': 0, 'neg': 0}
word_freqs[token][label] += 1
return word_freqs
def get_label_probdist(labeled_features):
label_fd = FreqDist()
for item,counts in labeled_features.items():
for label in ['neg','pos']:
if counts[label] > 0:
label_fd.inc(label)
label_probdist = ELEProbDist(label_fd)
return label_probdist
def get_feature_probdist(labeled_features):
feature_freqdist = defaultdict(FreqDist)
feature_values = defaultdict(set)
num_samples = len(train_samples) / 2
for token, counts in labeled_features.items():
for label in ['neg','pos']:
feature_freqdist[label, token].inc(True, count=counts[label])
feature_freqdist[label, token].inc(None, num_samples - counts[label])
feature_values[token].add(None)
feature_values[token].add(True)
for item in feature_freqdist.items():
print item[0],item[1]
feature_probdist = {}
for ((label, fname), freqdist) in feature_freqdist.items():
probdist = ELEProbDist(freqdist, bins=len(feature_values[fname]))
feature_probdist[label,fname] = probdist
return feature_probdist
labeled_features = get_labeled_features(train_samples)
label_probdist = get_label_probdist(labeled_features)
feature_probdist = get_feature_probdist(labeled_features)
classifier = NaiveBayesClassifier(label_probdist, feature_probdist)
for sample in test_samples:
print "%s | %s" % (sample, classifier.classify(bigramReturner(sample)))
but getting this error, why?
Traceback (most recent call last):
File "C:\python\naive_test.py", line 76, in <module>
print "%s | %s" % (sample, classifier.classify(bigramReturner(sample)))
File "C:\python\naive_test.py", line 23, in bigramReturner
bigramFeatureVector.append(' '.join(item))
AttributeError: 'dict' object has no attribute 'append'
A bigram feature vector follows the exact same principals as a unigram feature vector. So, just like the tutorial you mentioned you will have to check if a bigram feature is present in any of the documents you will use.
As for the bigram features and how to extract them, I have written the code bellow for it. You can simply adopt them to change the variable "tweets" in the tutorial.
import nltk
text = "Hi, I want to get the bigram list of this string"
for item in nltk.bigrams (text.split()): print ' '.join(item)
Instead of printing them you can simply append them to the "tweets" list and you are good to go! I hope this would be helpful enough. Otherwise, let me know if you still have problems.
Please note that in applications like sentiment analysis some researchers tend to tokenize the words and remove the punctuation and some others don't. From experince I know that if you don't remove punctuations, Naive bayes works almost the same, however an SVM would have a decreased accuracy rate. You might need to play around with this stuff and decide what works better on your dataset.
Edit 1:
There is a book named "Natural language processing with Python" which I can recommend it to you. It contains examples of bigrams as well as some exercises. However, I think you can even solve this case without it. The idea behind selecting bigrams a features is that we want to know the probabilty that word A would appear in our corpus followed by the word B. So, for example in the sentence
"I drive a truck"
the word unigram features would be each of those 4 words while the word bigram features would be:
["I drive", "drive a", "a truck"]
Now you want to use those 3 as your features. So the code function bellow puts all bigrams of a string in a list named bigramFeatureVector.
def bigramReturner (tweetString):
tweetString = tweetString.lower()
tweetString = removePunctuation (tweetString)
bigramFeatureVector = []
for item in nltk.bigrams(tweetString.split()):
bigramFeatureVector.append(' '.join(item))
return bigramFeatureVector
Note that you have to write your own removePunctuation function. What you get as output of the above function is the bigram feature vector. You will treat it exactly the same way the unigram feature vectors are treated in the tutorial you mentioned.