Twitter Sentiment Analysis - Remove bot duplication for more accurate results - python

Just to preface, this code is from a great guy on Github / Youtube:
https://github.com/the-javapocalypse/
I made some minor tweaks for my personal use.
One thing that has always stood between myself and sentiment analysis on twitter is the fact that so many bots posts exist. I figure if I cannot avoid the bots altogether, maybe I can just remove duplication to hedge the impact.
For example - "#bitcoin" or "#btc" - Bot accounts exist under many different handles posting the same exact tweet. It could say "It's going to the moon! Buy now #btc or forever regret it! Buy, buy, buy! Here's a link to my personal site [insert personal site url here]"
This would seem like a positive sentiment post. If 25 accounts post this 2 times per account, we have some inflation if I am only analyzing the recent 500 tweets containing "#btc"
So to my question:
What is an effective way to remove duplication before writing to the csv file? I was thinking of inputting a simple if statement and point to an array to check if it exists already. There is an issue with this. Say I input 1000 tweets to analyze. If 500 of these are duplication from bots, my 1000 tweet analysis just became a 501 tweet analysis. This leads to my next question
What is a way to include a check for duplication and if there is duplication add 1 each time to my total request for tweets to analyze. Example - I want to analyze 1000 tweets. Duplication was found one time, so there are 999 unique tweets to include in the analysis. I want the script to analyze one more to make it 1000 unique tweets (1001 tweets including the 1 duplicate)
Small change, but I think it would be effective to know how to remove all tweets with hyperlinks embedded. This would play into the objective of question 2 by compensating for dropping hyperlink tweets. Example - I want to analyze 1000 tweets. 500 of the 1000 have embedded URLs. The 500 are removed from the analysis. I am now down to 500 tweets. I still want 1000. Script needs to keep fetching non URL, non duplicates until 1000 unique non URL tweets have been accounted for.
See below for the entire script:
import tweepy
import csv
import re
from textblob import TextBlob
import matplotlib.pyplot as plt
class SentimentAnalysis:
def __init__(self):
self.tweets = []
self.tweetText = []
def DownloadData(self):
# authenticating
consumerKey = ''
consumerSecret = ''
accessToken = ''
accessTokenSecret = ''
auth = tweepy.OAuthHandler(consumerKey, consumerSecret)
auth.set_access_token(accessToken, accessTokenSecret)
api = tweepy.API(auth)
# input for term to be searched and how many tweets to search
searchTerm = input("Enter Keyword/Tag to search about: ")
NoOfTerms = int(input("Enter how many tweets to search: "))
# searching for tweets
self.tweets = tweepy.Cursor(api.search, q=searchTerm, lang="en").items(NoOfTerms)
csvFile = open('result.csv', 'a')
csvWriter = csv.writer(csvFile)
# creating some variables to store info
polarity = 0
positive = 0
negative = 0
neutral = 0
# iterating through tweets fetched
for tweet in self.tweets:
# Append to temp so that we can store in csv later. I use encode UTF-8
self.tweetText.append(self.cleanTweet(tweet.text).encode('utf-8'))
analysis = TextBlob(tweet.text)
# print(analysis.sentiment) # print tweet's polarity
polarity += analysis.sentiment.polarity # adding up polarities
if (analysis.sentiment.polarity == 0): # adding reaction
neutral += 1
elif (analysis.sentiment.polarity > 0.0):
positive += 1
else:
negative += 1
csvWriter.writerow(self.tweetText)
csvFile.close()
# finding average of how people are reacting
positive = self.percentage(positive, NoOfTerms)
negative = self.percentage(negative, NoOfTerms)
neutral = self.percentage(neutral, NoOfTerms)
# finding average reaction
polarity = polarity / NoOfTerms
# printing out data
print("How people are reacting on " + searchTerm +
" by analyzing " + str(NoOfTerms) + " tweets.")
print()
print("General Report: ")
if (polarity == 0):
print("Neutral")
elif (polarity > 0.0):
print("Positive")
else:
print("Negative")
print()
print("Detailed Report: ")
print(str(positive) + "% positive")
print(str(negative) + "% negative")
print(str(neutral) + "% neutral")
self.plotPieChart(positive, negative, neutral, searchTerm, NoOfTerms)
def cleanTweet(self, tweet):
# Remove Links, Special Characters etc from tweet
return ' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t]) | (\w +:\ / \ / \S +)", " ", tweet).split())
# function to calculate percentage
def percentage(self, part, whole):
temp = 100 * float(part) / float(whole)
return format(temp, '.2f')
def plotPieChart(self, positive, negative, neutral, searchTerm, noOfSearchTerms):
labels = ['Positive [' + str(positive) + '%]', 'Neutral [' + str(neutral) + '%]',
'Negative [' + str(negative) + '%]']
sizes = [positive, neutral, negative]
colors = ['yellowgreen', 'gold', 'red']
patches, texts = plt.pie(sizes, colors=colors, startangle=90)
plt.legend(patches, labels, loc="best")
plt.title('How people are reacting on ' + searchTerm +
' by analyzing ' + str(noOfSearchTerms) + ' Tweets.')
plt.axis('equal')
plt.tight_layout()
plt.show()
if __name__ == "__main__":
sa = SentimentAnalysis()
sa.DownloadData()

Answer to your first question
You can remove duplicates by using this one liner.
self.tweets = list(set(self.tweets))
This will remove every duplicated tweet. Just in case if you want to see it working, here is a simple example
>>> tweets = ['this is a tweet', 'this is a tweet', 'Yet another Tweet', 'this is a tweet']
>>> print(tweets)
['this is a tweet', 'this is a tweet', 'Yet another Tweet', 'this is a tweet']
>>> tweets = list(set(tweets))
>>> print(tweets)
['this is a tweet', 'Yet another Tweet']
Answer to your second question
Since now you have already removed duplicates, you can get the number of removed tweets by taking the difference of self.tweets and NoOfTerms
tweets_to_further_scrape = NoOfTerms - self.tweets
Now you can scrape tweets_to_further_scrape number of tweets and repeat this process of removing duplication and scraping until you have found desired number of unique tweets.
Answer to your third question
When iterating tweets list, add this line to remove external links.
tweet.text = ' '.join([i for i in tweet.text.split() if 'http' not in i])
Hope this will help you out. Happy coding!

You could simply keep a running count of the tweet instances using a defaultdict. You may want to remove the web addresses as well, in case they are blasting out new shortened URLs.
from collections import defaultdict
def __init__(self):
...
tweet_count = defaultdict(int)
def track_tweet(self, tweet):
t = self.clean_tweet(tweet)
self.tweet_count[t] += 1
def clean_tweet(self, tweet):
t = tweet.lower()
# any other tweet normalization happens here, such as dropping URLs
return t
def DownloadData(self):
...
for tweet in self.tweets:
...
# add logic to check for number of repeats in the dictionary.

Related

Anonymisation of xMillion entries - need performance hints

Edit: it was huge mistake to initialize name-dataset on each call (its loading the blacklist-name-dataset into memory).
Just moved m = NameDataset() to the main function. Did not measured, but its now at least 100 times faster.
I developed a (multilingual) and fast searchable DB with dataTables. I want to anonymize some data (names etc.) from the MySQL DB.
In first instance I used spaCy - but hence the dictionaries were not trained - the result had too much false-positives. Right now I am pretty happy with name-dataset. Its working completely different of course and its much slower as spaCy, which benefits of GPU-Power.
Hence the custom names-DB got pretty big (350.000 lines) and target DB is huge - the processing of each found word with regex.finditer in a loop takes for ever (Ryzen 7 3700X).
We can say 5 cases/sec - which makes >100 hours for some Million rows.
Hence each process eats just about 10% CPU power, I start several (up to 10) python processes - on the end its still takes too long.
I hope, I never have to do this again - but I am afraid, that I have to.
Thatswhy I ask, do you have any performance tipps for the following routines?
the outer for loop in main() - which is looping through the piped object (the DB rows) and calls three times (= 3 items/columns) anonymize()
which has a for loop as well, which runs through each found word
Would it make sense to rewrite them, to use CUDA/numba (RTX 2070 available) etc.?
Any other performance tipps? Thanks!
import simplejson as json
import sys, regex, logging, os
from names_dataset import NameDataset
def anonymize(sourceString, col):
replacement = 'xxx'
output = ''
words = sourceString.split(' ')
#and this second loop for each word (will run three times per row)
for word in words:
newword = word
#regex for findind/splitting the words
fRegExStr = r'(?=[^\s\r\n|\(|\)])(\w+)(?=[\.\?:,!\-/\s\(\)]|$)'
pattern = regex.compile(fRegExStr)
regx = pattern.finditer(word)
if regx is None:
if m.search_first_name(word, use_upper_Row=True):
output += replacement
elif m.search_last_name(word, use_upper_Row=True):
output += replacement
else:
output += word
else:
for eachword in regx:
if m.search_first_name(eachword.group(), use_upper_Row=True):
newword = newword.replace(eachword.group(), replacement)
elif m.search_last_name(eachword.group(), use_upper_Row=True):
newword = newword.replace(eachword.group(), replacement)
output += newword
output += ' '
return output
def main():
#object with data is been piped to the python script, data structure:
#MyRows: {
# [Text_A: 'some text', Text_B: 'some more text', Text_C: 'still text'],
# [Text_A: 'some text', Text_B: 'some more text', Text_C: 'still text'],
# ....several thousand rows
# }
MyRows = json.load(sys.stdin, 'utf-8')
#this is the first outer loop for each row
for Row in MyRows:
xText_A = Row['Text_A']
if Row['Text_A'] and len(Row['Text_A']) > 30:
Row['Text_A'] = anonymize(xText_A, 'Text_A')
xText_B = Row['Text_B']
if xText_B and len(xText_B) > 10:
Row['Text_B'] = anonymize(xText_B, 'Text_B')
xMyRowText_C = Row['MyRowText_C']
if xMyRowText_C and len(xMyRowText_C) > 10:
Row['MyRowText_C'] = anonymize(xMyRowText_C, 'MyRowText_C')
retVal = json.dumps(MyRows, 'utf-8')
return retVal
if __name__ == '__main__':
m = NameDataset() ## Right here is good - THIS WAS THE BOTTLENECK ##
retVal = main()
sys.stdout.write(str(retVal))
You are doing
for word in words:
newword = word
#regex for findind/splitting the words
fRegExStr = r'(?=[^\s\r\n|\(|\)])(\w+)(?=[\.\?:,!\-/\s\(\)]|$)'
pattern = regex.compile(fRegExStr)
regx = pattern.finditer(word)
meaning that you regex.compile exactly same thing in each turn in loop, whilst you can do it before loop begin and get same result.
I do not see other obvious optimizations, so I suggest to profile code to found what part is most time consuming.

How to extract sub strings from huge string in python?

I have huge strings (7-10k characters) from log files that I need to automatically extract and tabulate information from. Each string contains approximately 40 values that are input by various people. Example;
Example string 1.) 'Color=Blue, [randomJunkdataExampleHere] Weight=345Kg, Age=34 Years, error#1 randomJunkdataExampleThere error#1'
Example string 2.) '[randomJunkdataExampleHere] Color=Red 42, Weight=256 Lbs., Age=34yers, error#1, error#2'
Example string 3.) 'Color=Yellow 13,Weight=345lbs., Age=56 [randomJunkdataExampleHere]'
Desired outcome is a new string, or even dictionary that organizes the data and readies for database entry (one string for each row of data);
Color,Weight,Age,Error#1Count,Error#2Count
blue,345,34,2,0
red,256,24,1,1
yellow,345,56,0,0
Considered using re.search for each column/value, but since there's variance in how users input data I don't know how to trap just the numbers that I want to extract. Also have no idea how to capture the number of times a 'Error#1Count' occurs in the string.
import re
line = '[randomJunkdataExampleHere] Color=Blue, Weight=345Kg, Age=34 Years, error#1, randomJunkdataExampleThere error#1'
try:
Weight = re.search('Weight=(.+?), Age',line).group(1)
except AttributeError:
Weight = 'ERROR'
Goal/Result:
Color,Weight,Age,Error#1Count,Error#2Count
blue,345,34,2,0
red,256,24,1,1
yellow,345,56,0,0
As stated above, 10000 characters really isn't a huge deal.
import time
example_string_1 = 'Color=Blue, Weight=345Kg, Age=34 Years, error#1, error#1'
example_string_2 = 'Color=Red 42, Weight=256 Lbs., Age=34 yers, error#1, error#2'
example_string_3 = 'Color=Yellow 13, Weight=345lbs., Age=56'
def run():
examples = [example_string_1, example_string_2, example_string_3]
dict_list = []
for example in examples:
# first, I would suggest tokenizing the string to identify individual data entries.
tokens = example.split(', ')
my_dict = {}
for token in tokens: # Non-error case
if '=' in token:
subtokens = token.split('=') # this will split the token into two parts, i.e ['Color', 'Blue']
my_dict[subtokens[0]] = subtokens[1]
elif '#' in token: # error case. Not sure if this is actually the format. If not, you'll have to find something to key off of.
if 'error' not in my_dict or my_dict['error'] is None:
my_dict['error'] = [token]
else:
my_dict['error'].append(token)
dict_list.append(my_dict)
# Now lets test out how fast it is.
before = time.time()
for i in range(100000): # run it a hundred thousand times
run()
after = time.time()
print("Time: {0}".format(after - before))
Yields:
Time: 0.5782015323638916
See? Not too bad. Now all that is left is to iterate over the dictionary and record the metrics you want.

How to avoid for loops and iterate through pandas dataframe properly?

I have this code that I've been struggling for a while to optimize.
My dataframe is a csv file with 2 columns, out of which the second column contains texts. Looks like on the picture:
I have a function summarize(text, n) that needs a single text and an integer as input.
def summarize(text, n):
sents = sent_tokenize(text) # text into tokenized sentences
# Checking if there are less sentences in the given review than the required length of the summary
assert n <= len(sents)
list_sentences = [word_tokenize(s.lower()) for s in sents] # word tokenized sentences
frequency = calculate_freq(list_sentences) # calculating the word frequency for all the sentences
ranking = defaultdict(int)
for i, sent in enumerate(list_sentences):
for w in sent:
if w in frequency:
ranking[i] += frequency[w]
# Calling the rank function to get the highest ranking
sents_idx = rank(ranking, n)
# Return the best choices
return [sents[j] for j in sents_idx]
So summarize() all the texts, I first iterate through my dataframe and create a list of all the texts, which I later iterate again to send them one by one to the summarize() function so I can get the summary of the text. These for loops are making my code really, really slow, but I haven't been able to figure out a way to make it more efficient, and I would greatly appreciate any suggestions.
data = pd.read_csv('dataframe.csv')
text = data.iloc[:,2] # ilocating the texts
list_of_strings = []
for t in text:
list_of_strings.append(t) # creating a list of all the texts
our_summary = []
for s in list_of_strings:
for f in summarize(s, 1):
our_summary.append(f)
ours = pd.DataFrame({"our_summary": our_summary})
EDIT:
other two functions are:
def calculate_freq(list_sentences):
frequency = defaultdict(int)
for sentence in list_sentences:
for word in sentence:
if word not in our_stopwords:
frequency[word] += 1
# We want to filter out the words with frequency below 0.1 or above 0.9 (once normalized)
if frequency.values():
max_word = float(max(frequency.values()))
else:
max_word = 1
for w in frequency.keys():
frequency[w] = frequency[w]/max_word # normalize
if frequency[w] <= min_freq or frequency[w] >= max_freq:
del frequency[w] # filter
return frequency
def rank(ranking, n):
# return n first sentences with highest ranking
return nlargest(n, ranking, key=ranking.get)
Input text: Recipes are easy and the dogs love them. I would buy this book again and again. Only thing is that the recipes don't tell you how many treats they make, but I suppose that's because you could make them all different sizes. Great buy!
Output text: I would buy this book again and again.
Have you tried something like this?
# Test data
df = pd.DataFrame({'ASIN': [0,1], 'Summary': ['This is the first text', 'Second text']})
# Example function
def summarize(text, n=5):
"""A very basic summary"""
return (text[:n] + '..') if len(text) > n else text
# Applying the function to the text
df['Result'] = df['Summary'].map(summarize)
# ASIN Summary Result
# 0 0 This is the first text This ..
# 1 1 Second text Secon..
Such a long story...
I'm going to assume since you are performing a text frequency analysis, the order of reviewText don't matter. If that is the case:
Mega_String = ' '.join(data['reviewText'])
This should concat all strings in review text function into one big string, with each review separated with a white space.
You can just throw this result to your functions.

sentiment analysis of Non-English tweets in python

Objective: To classify each tweet as positive or negative and write it to an output file which will contain the username, original tweet and the sentiment of the tweet.
Code:
import re,math
input_file="raw_data.csv"
fileout=open("Output.txt","w")
wordFile=open("words.txt","w")
expression=r"(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"
fileAFINN = 'AFINN-111.txt'
afinn = dict(map(lambda (w, s): (w, int(s)), [ws.strip().split('\t') for ws in open(fileAFINN)]))
pattern=re.compile(r'\w+')
pattern_split = re.compile(r"\W+")
words = pattern_split.split(input_file.lower())
print "File processing started"
with open(input_file,'r') as myfile:
for line in myfile:
line = line.lower()
line=re.sub(expression," ",line)
words = pattern_split.split(line.lower())
sentiments = map(lambda word: afinn.get(word, 0), words)
#print sentiments
# How should you weight the individual word sentiments?
# You could do N, sqrt(N) or 1 for example. Here I use sqrt(N)
"""
Returns a float for sentiment strength based on the input text.
Positive values are positive valence, negative value are negative valence.
"""
if sentiments:
sentiment = float(sum(sentiments))/math.sqrt(len(sentiments))
#wordFile.write(sentiments)
else:
sentiment = 0
wordFile.write(line+','+str(sentiment)+'\n')
fileout.write(line+'\n')
print "File processing completed"
fileout.close()
myfile.close()
wordFile.close()
Issue: Apparently the output.txt file is
abc some tweet text 0
bcd some more tweets 1
efg some more tweet 0
Question 1: How do I add a comma between the userid tweet-text sentiment? The output should be like;
abc,some tweet text,0
bcd,some other tweet,1
efg,more tweets,0
Question 2: The tweets are in Bahasa Melayu (BM) and the AFINN dictionary that I am using is of English words. So the classification is wrong. Do you know any BM dictionary that I can use?
Question 3: How do I pack this code in a JAR file?
Thank you.
Question 1:
output.txt is currently simply composed of the lines you are reading in because of fileout.write(line+'\n'). Since it is space separated, you can separate the line pretty easily
line_data = line.split(' ') # Split the line into a list, separated by spaces
user_id = line_data[0] # The first element of the list
tweets = line_data[1:-1] # The middle elements of the list
sentiment = line_data[-1] # The last element of the list
fileout.write(user_id + "," + " ".join(tweets) + "," + sentiment +'\n')
Question 2:
A quick google search gave me this. Not sure if it has everything you will need though: https://archive.org/stream/grammardictionar02craw/grammardictionar02craw_djvu.txt
Question 3:
Try Jython http://www.jython.org/archive/21/docs/jythonc.html

I'm using the following code to obtain tweets from the twitter,How can I get the tweets alone from it?

In the variable x here I have all the contents in the page, But I wish to obtain the tweet texts alone. How can I do this?
from twitter import *
t = Twitter(auth=OAuth("1865941472-AbdUiX4843PBSkz0LwiLXlbbIPj20w9UQKYg5lY",
"WJ76T7i0PDotsP8C42F74hbhzbtUT5cxV3z9ZbcZCuw",
"lNhLOub6HsRm0sukRuyVA",
"QfwvN94uXX55rJ6b5tOCDwCUTfsHXnfxzxRf1Fgt1k"))
t.statuses.home_timeline()
x = t.search.tweets(q="#pycon")
t.statuses.home_timeline()
print x
If you want to get tweets from your timeline:
home = t.statuses.home_timeline()
for i in range(len(home)):
print home[i]['user']['name'] + ' (#' + home[i]['user']['screen_name'] + '): ' + home[i]['text']
You'll get output in this pretty style:
username (#userlogin): text of tweet
As default, home_timeline() returns 20 recent tweets. You may change number of tweets to count you want, using home_timeline(count=cnt). Another way is using TwitterStream() instead of Twitter().
If you want to get tweets from search:
query = t.search.tweets(q="#pycon")
for i in range():
print query['statuses'][i]['user']['name']+' (#'+query['statuses'][i]['user']['screen_name'] + ') wrote:', query['statuses'][i]['text']

Categories

Resources