i used sentiment analysis using python from twitter to judge whether its negative, positive, or neutral, with this code:
hasilAnalisis = []
for tweets in hasilUser:
tweets_properties = {}
tweets_properties['tanggal_tweet'] = tweets.created_at
tweets_properties['pengguna'] = tweets.user.screen_name
tweets_properties['isi_tweet'] = tweets.text
tweets_full_cleansing = ' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",tweets.text).split())
analysis = TextBlob(tweets_full_cleansing)
try:
analysis = analysis.translate(to='en')
except Exception as e:
print(e)
if analysis.sentiment.polarity > 0.0:
tweets_properties["sentimen"] = "positif"
elif analysis.sentiment.polarity == 0.0:
tweets_properties["sentimen"] = "Netral"
else:
tweets_properties["sentimen"] = "Negatif"
if tweets.retweet_count > 0:
if tweets_properties not in hasilAnalisis:
hasilAnalisis.append(tweets_properties)
else:
hasilAnalisis.append(tweets_properties)
the problem is, since the tweets that i want to find out is in indonesian, so i have to translate it first into english using this code
try:
analysis = analysis.translate(to='en')
except Exception as e:
print(e)
after that the code can be judge or valued based on sentiment because there's already converted to english and analysis.sentiment.polarity can be done.
is there any way to do this without translate it into english first?
based on this https://ksnugroho.medium.com/dasar-text-preprocessing-dengan-python-a4fa52608ffe
i can use sastrawi for tokenization, but no idea how to use sentiment polarity in indonesian language.
I investigated your question and I find a method for this. You can use:
Sentiment Lexicon for Indonesian in This Github link
Sentiment lexicon is usually has for every language.
Related
I apologise in advance for posting so much code.
I am trying to classify YouTube comments into ones that contain opinion (be it positive or negative) and ones that don't using NLTK's Naive Bayes classifier, but no matter what I do during the preprocessing stage I can't really get the accuracy above 0.75. This seems kinda low compared to other examples I have seen - this tutorial ends up with an accuracy of around 0.98 for example.
Here is my full code
import nltk, re, json, random
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tag import pos_tag
from nltk.tokenize import TweetTokenizer
from nltk import FreqDist, classify, NaiveBayesClassifier
from contractions import CONTRACTION_MAP
from abbreviations import abbrev_map
from tqdm.notebook import tqdm
def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
text = re.sub(r"’", "'", text)
if text in abbrev_map:
return(abbrev_map[text])
text = re.sub(r"\bluv", "lov", text)
contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())),
flags=re.IGNORECASE|re.DOTALL)
def expand_match(contraction):
match = contraction.group(0)
first_char = match[0]
expanded_contraction = contraction_mapping.get(match)\
if contraction_mapping.get(match)\
else contraction_mapping.get(match.lower())
expanded_contraction = first_char+expanded_contraction[1:]
return expanded_contraction
expanded_text = contractions_pattern.sub(expand_match, text)
return expanded_text
def reduce_lengthening(text):
pattern = re.compile(r"(.)\1{2,}")
return pattern.sub(r"\1\1", text)
def lemmatize_sentence(tokens):
lemmatizer = WordNetLemmatizer()
lemmatized_sentence = []
for word, tag in pos_tag(tokens):
if tag.startswith('NN'):
pos = 'n'
elif tag.startswith('VB'):
pos = 'v'
else:
pos = 'a'
lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
return lemmatized_sentence
def processor(comments_list):
new_comments_list = []
for com in tqdm(comments_list):
com = com.lower()
#expand out contractions
tok = com.split(" ")
z = []
for w in tok:
ex_w = expand_contractions(w)
z.append(ex_w)
st = " ".join(z)
tokenized = tokenizer.tokenize(st)
reduced = [reduce_lengthening(token) for token in tokenized]
new_comments_list.append(reduced)
lemmatized = [lemmatize_sentence(new_com) for new_com in new_comments_list]
return(lemmatized)
def get_all_words(cleaned_tokens_list):
for tokens in cleaned_tokens_list:
for token in tokens:
yield token
def get_comments_for_model(cleaned_tokens_list):
for comment_tokens in cleaned_tokens_list:
yield dict([token, True] for token in comment_tokens)
if __name__ == "__main__":
#=================================================================================~
tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)
with open ("english_lang/samples/training_set.json", "r", encoding="utf8") as f:
train_data = json.load(f)
pos_processed = processor(train_data['pos'])
neg_processed = processor(train_data['neg'])
neu_processed = processor(train_data['neu'])
emotion = pos_processed + neg_processed
random.shuffle(emotion)
em_tokens_for_model = get_comments_for_model(emotion)
neu_tokens_for_model = get_comments_for_model(neu_processed)
em_dataset = [(comment_dict, "Emotion")
for comment_dict in em_tokens_for_model]
neu_dataset = [(comment_dict, "Neutral")
for comment_dict in neu_tokens_for_model]
dataset = em_dataset + neu_dataset
random.shuffle(dataset)
x = 700
tr_data = dataset[:x]
te_data = dataset[x:]
classifier = NaiveBayesClassifier.train(tr_data)
print(classify.accuracy(classifier, te_data))
I can post my training data set if needed, but it's probably worth mentioning that the quality of English is very poor and inconsistent in the YouTube comments themselves (which I imagine is the reason for the low model accuracy). In any case, would this be considered an acceptable level of accuracy?
Alternatively, I may well be going about this all wrong and there is a far superior model to be using, in which case feel free to tell me I am an idiot!
Thanks in advance
It is not statistically valid to compare your results against those of an unrelated tutorial. Before you panic, please do appropriate research on the factors that can reduce a model's accuracy. First and foremost, your model cannot exhibit an accuracy higher than that inherent in the data set's information. For instance, no model can perform (in the long run) better than 50% in predicting a random binary event, regardless of the data set.
We have no reasonable way to evaluate the theoretical information content. If you need a check, try applying some other model types to the same data, and see what they produce for accuracy. Running these experiments is a normal part of data science.
Just to preface, this code is from a great guy on Github / Youtube:
https://github.com/the-javapocalypse/
I made some minor tweaks for my personal use.
One thing that has always stood between myself and sentiment analysis on twitter is the fact that so many bots posts exist. I figure if I cannot avoid the bots altogether, maybe I can just remove duplication to hedge the impact.
For example - "#bitcoin" or "#btc" - Bot accounts exist under many different handles posting the same exact tweet. It could say "It's going to the moon! Buy now #btc or forever regret it! Buy, buy, buy! Here's a link to my personal site [insert personal site url here]"
This would seem like a positive sentiment post. If 25 accounts post this 2 times per account, we have some inflation if I am only analyzing the recent 500 tweets containing "#btc"
So to my question:
What is an effective way to remove duplication before writing to the csv file? I was thinking of inputting a simple if statement and point to an array to check if it exists already. There is an issue with this. Say I input 1000 tweets to analyze. If 500 of these are duplication from bots, my 1000 tweet analysis just became a 501 tweet analysis. This leads to my next question
What is a way to include a check for duplication and if there is duplication add 1 each time to my total request for tweets to analyze. Example - I want to analyze 1000 tweets. Duplication was found one time, so there are 999 unique tweets to include in the analysis. I want the script to analyze one more to make it 1000 unique tweets (1001 tweets including the 1 duplicate)
Small change, but I think it would be effective to know how to remove all tweets with hyperlinks embedded. This would play into the objective of question 2 by compensating for dropping hyperlink tweets. Example - I want to analyze 1000 tweets. 500 of the 1000 have embedded URLs. The 500 are removed from the analysis. I am now down to 500 tweets. I still want 1000. Script needs to keep fetching non URL, non duplicates until 1000 unique non URL tweets have been accounted for.
See below for the entire script:
import tweepy
import csv
import re
from textblob import TextBlob
import matplotlib.pyplot as plt
class SentimentAnalysis:
def __init__(self):
self.tweets = []
self.tweetText = []
def DownloadData(self):
# authenticating
consumerKey = ''
consumerSecret = ''
accessToken = ''
accessTokenSecret = ''
auth = tweepy.OAuthHandler(consumerKey, consumerSecret)
auth.set_access_token(accessToken, accessTokenSecret)
api = tweepy.API(auth)
# input for term to be searched and how many tweets to search
searchTerm = input("Enter Keyword/Tag to search about: ")
NoOfTerms = int(input("Enter how many tweets to search: "))
# searching for tweets
self.tweets = tweepy.Cursor(api.search, q=searchTerm, lang="en").items(NoOfTerms)
csvFile = open('result.csv', 'a')
csvWriter = csv.writer(csvFile)
# creating some variables to store info
polarity = 0
positive = 0
negative = 0
neutral = 0
# iterating through tweets fetched
for tweet in self.tweets:
# Append to temp so that we can store in csv later. I use encode UTF-8
self.tweetText.append(self.cleanTweet(tweet.text).encode('utf-8'))
analysis = TextBlob(tweet.text)
# print(analysis.sentiment) # print tweet's polarity
polarity += analysis.sentiment.polarity # adding up polarities
if (analysis.sentiment.polarity == 0): # adding reaction
neutral += 1
elif (analysis.sentiment.polarity > 0.0):
positive += 1
else:
negative += 1
csvWriter.writerow(self.tweetText)
csvFile.close()
# finding average of how people are reacting
positive = self.percentage(positive, NoOfTerms)
negative = self.percentage(negative, NoOfTerms)
neutral = self.percentage(neutral, NoOfTerms)
# finding average reaction
polarity = polarity / NoOfTerms
# printing out data
print("How people are reacting on " + searchTerm +
" by analyzing " + str(NoOfTerms) + " tweets.")
print()
print("General Report: ")
if (polarity == 0):
print("Neutral")
elif (polarity > 0.0):
print("Positive")
else:
print("Negative")
print()
print("Detailed Report: ")
print(str(positive) + "% positive")
print(str(negative) + "% negative")
print(str(neutral) + "% neutral")
self.plotPieChart(positive, negative, neutral, searchTerm, NoOfTerms)
def cleanTweet(self, tweet):
# Remove Links, Special Characters etc from tweet
return ' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t]) | (\w +:\ / \ / \S +)", " ", tweet).split())
# function to calculate percentage
def percentage(self, part, whole):
temp = 100 * float(part) / float(whole)
return format(temp, '.2f')
def plotPieChart(self, positive, negative, neutral, searchTerm, noOfSearchTerms):
labels = ['Positive [' + str(positive) + '%]', 'Neutral [' + str(neutral) + '%]',
'Negative [' + str(negative) + '%]']
sizes = [positive, neutral, negative]
colors = ['yellowgreen', 'gold', 'red']
patches, texts = plt.pie(sizes, colors=colors, startangle=90)
plt.legend(patches, labels, loc="best")
plt.title('How people are reacting on ' + searchTerm +
' by analyzing ' + str(noOfSearchTerms) + ' Tweets.')
plt.axis('equal')
plt.tight_layout()
plt.show()
if __name__ == "__main__":
sa = SentimentAnalysis()
sa.DownloadData()
Answer to your first question
You can remove duplicates by using this one liner.
self.tweets = list(set(self.tweets))
This will remove every duplicated tweet. Just in case if you want to see it working, here is a simple example
>>> tweets = ['this is a tweet', 'this is a tweet', 'Yet another Tweet', 'this is a tweet']
>>> print(tweets)
['this is a tweet', 'this is a tweet', 'Yet another Tweet', 'this is a tweet']
>>> tweets = list(set(tweets))
>>> print(tweets)
['this is a tweet', 'Yet another Tweet']
Answer to your second question
Since now you have already removed duplicates, you can get the number of removed tweets by taking the difference of self.tweets and NoOfTerms
tweets_to_further_scrape = NoOfTerms - self.tweets
Now you can scrape tweets_to_further_scrape number of tweets and repeat this process of removing duplication and scraping until you have found desired number of unique tweets.
Answer to your third question
When iterating tweets list, add this line to remove external links.
tweet.text = ' '.join([i for i in tweet.text.split() if 'http' not in i])
Hope this will help you out. Happy coding!
You could simply keep a running count of the tweet instances using a defaultdict. You may want to remove the web addresses as well, in case they are blasting out new shortened URLs.
from collections import defaultdict
def __init__(self):
...
tweet_count = defaultdict(int)
def track_tweet(self, tweet):
t = self.clean_tweet(tweet)
self.tweet_count[t] += 1
def clean_tweet(self, tweet):
t = tweet.lower()
# any other tweet normalization happens here, such as dropping URLs
return t
def DownloadData(self):
...
for tweet in self.tweets:
...
# add logic to check for number of repeats in the dictionary.
I am doing sentiment analysis on twitter data using python NLTK. I need a dictionary which contains +ve and -ve polarities of words. I have read so much stuff regarding sentiwordnet but when I am using it for my project it is not giving efficient and fast results. I think I'm not using it correctly. Can anyone tell me correct way to use it? Here are the steps I did up to now:
tokenization of tweets
POS tagging of tokens
passing each tags to sentinet
I am using the nltk package for tokenization and tagging. See a part of my code below:
import nltk
from nltk.stem import *
from nltk.corpus import sentiwordnet as swn
tokens=nltk.word_tokenize(row) #for tokenization, row is line of a file in which tweets are saved.
tagged=nltk.pos_tag(tokens) #for POSTagging
for i in range(0,len(tagged)):
if 'NN' in tagged[i][1] and len(swn.senti_synsets(tagged[i][0],'n'))>0:
pscore+=(list(swn.senti_synsets(tagged[i][0],'n'))[0]).pos_score() #positive score of a word
nscore+=(list(swn.senti_synsets(tagged[i][0],'n'))[0]).neg_score() #negative score of a word
elif 'VB' in tagged[i][1] and len(swn.senti_synsets(tagged[i][0],'v'))>0:
pscore+=(list(swn.senti_synsets(tagged[i][0],'v'))[0]).pos_score()
nscore+=(list(swn.senti_synsets(tagged[i][0],'v'))[0]).neg_score()
elif 'JJ' in tagged[i][1] and len(swn.senti_synsets(tagged[i][0],'a'))>0:
pscore+=(list(swn.senti_synsets(tagged[i][0],'a'))[0]).pos_score()
nscore+=(list(swn.senti_synsets(tagged[i][0],'a'))[0]).neg_score()
elif 'RB' in tagged[i][1] and len(swn.senti_synsets(tagged[i][0],'r'))>0:
pscore+=(list(swn.senti_synsets(tagged[i][0],'r'))[0]).pos_score()
nscore+=(list(swn.senti_synsets(tagged[i][0],'r'))[0]).neg_score()
At the end I will be calculating how many tweets are positive and how many tweets are negative.
Where am I wrong? How should I use it? And is there any other similar kind of dictionary which is easy to use?
Yes, there are other lexicons that you can use. You can find a small list of lexicons here: http://sentiment.christopherpotts.net/lexicons.html#resources
It seems Bing Liu's Opinion Lexicon is quite easy to use.
Apart from linking to those lexicons that website is a very nice tutorial on sentiment analysis.
calculate the sentiment
alist = [all_tokens_in_doc]
totalScore = 0
count_words_included = 0
for word in all_words_in_comment:
synset_forms = list(swn.senti_synsets(word[0], word[1]))
if not synset_forms:
continue
synset = synset_forms[0]
totalScore = totalScore + synset.pos_score() - synset.neg_score()
count_words_included = count_words_included +1
final_dec = ''
if count_words_included == 0:
final_dec = 'N/A'
elif totalScore == 0:
final_dec = 'Neu'
elif totalScore/count_words_included < 0:
final_dec = 'Neg'
elif totalScore/count_words_included > 0:
final_dec = 'Pos'
return final_dec
Objective: To classify each tweet as positive or negative and write it to an output file which will contain the username, original tweet and the sentiment of the tweet.
Code:
import re,math
input_file="raw_data.csv"
fileout=open("Output.txt","w")
wordFile=open("words.txt","w")
expression=r"(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"
fileAFINN = 'AFINN-111.txt'
afinn = dict(map(lambda (w, s): (w, int(s)), [ws.strip().split('\t') for ws in open(fileAFINN)]))
pattern=re.compile(r'\w+')
pattern_split = re.compile(r"\W+")
words = pattern_split.split(input_file.lower())
print "File processing started"
with open(input_file,'r') as myfile:
for line in myfile:
line = line.lower()
line=re.sub(expression," ",line)
words = pattern_split.split(line.lower())
sentiments = map(lambda word: afinn.get(word, 0), words)
#print sentiments
# How should you weight the individual word sentiments?
# You could do N, sqrt(N) or 1 for example. Here I use sqrt(N)
"""
Returns a float for sentiment strength based on the input text.
Positive values are positive valence, negative value are negative valence.
"""
if sentiments:
sentiment = float(sum(sentiments))/math.sqrt(len(sentiments))
#wordFile.write(sentiments)
else:
sentiment = 0
wordFile.write(line+','+str(sentiment)+'\n')
fileout.write(line+'\n')
print "File processing completed"
fileout.close()
myfile.close()
wordFile.close()
Issue: Apparently the output.txt file is
abc some tweet text 0
bcd some more tweets 1
efg some more tweet 0
Question 1: How do I add a comma between the userid tweet-text sentiment? The output should be like;
abc,some tweet text,0
bcd,some other tweet,1
efg,more tweets,0
Question 2: The tweets are in Bahasa Melayu (BM) and the AFINN dictionary that I am using is of English words. So the classification is wrong. Do you know any BM dictionary that I can use?
Question 3: How do I pack this code in a JAR file?
Thank you.
Question 1:
output.txt is currently simply composed of the lines you are reading in because of fileout.write(line+'\n'). Since it is space separated, you can separate the line pretty easily
line_data = line.split(' ') # Split the line into a list, separated by spaces
user_id = line_data[0] # The first element of the list
tweets = line_data[1:-1] # The middle elements of the list
sentiment = line_data[-1] # The last element of the list
fileout.write(user_id + "," + " ".join(tweets) + "," + sentiment +'\n')
Question 2:
A quick google search gave me this. Not sure if it has everything you will need though: https://archive.org/stream/grammardictionar02craw/grammardictionar02craw_djvu.txt
Question 3:
Try Jython http://www.jython.org/archive/21/docs/jythonc.html
I wan to to get the 'ing' form of a verb.
Currently I am using this method which depends on Nodebox English Linguistics library. And my code fails in most cases.
from libs.en import *
def get_contineous_tense(i_verb):
i_verb = verb.infinitive(i_verb) #Make sure that the verb is in infinfinitive form
temp = i_verb + 'ing'
if verb.infinitive(temp) == i_verb:
return temp
temp = i_verb + i_verb[-1:] + 'ing'
if verb.infinitive(temp) == i_verb:
return temp
#......... Continues like This
print get_contineous_tense('played')
verb.present_participle(word)
This is functionality that comes with NodeBox English Linguistics, documented right in the link you gave.