I'm in need of some advice with my Twitter sentiment analysis.
I'm trying to do a pretty common sentiment analysis, but not on random tweets from Twitter search, but on the tweets of selected users.
What I've tried so far, is that I read in this csv of the users. And then iterated over this list and then, user by user conducted this tweet analysis.
I'll put my write_tweets function here, just so it could get some feedback maybe :)
def write_tweets(users_df, file):
# If the file exists, then read the existing data from the CSV file.
if os.path.exists(file):
df = pd.read_csv(file, header=0)
else:
df = pd.DataFrame(columns=COLS)
#page attribute in tweepy.cursor and iteration
for user in users_df[0]:
#for user in users_list:
print(user)
#user = 'fion_li'
try:
#for status in tweepy.Cursor(api.user_timeline, screen_name=user, count = 1,tweet_mode="extended").items(22):
for status in tweepy.Cursor(api.user_timeline, screen_name=user, count = 1,tweet_mode="extended").items(1):
#print(status)
new_entry = []
status = status._json
#print(to_datetime(status['created_at']))
#print(status['full_text'])
#csvFile = open(file, 'a' ,encoding='utf-8')
if (to_datetime(status['created_at']) < startDate):
#print(to_datetime(status['created_at']))
#print(status['full_text'])
continue
## check whether the tweet is in english or skip to the next tweet
if status['lang'] != 'en':
continue
#tweepy preprocessing called for basic preprocessing
#clean_text = clean(status['entities'])
clean_text = clean(status['full_text'])
#call clean_tweet method for extra preprocessing
filtered_tweet=clean_tweets(clean_text)
#pass textBlob method for sentiment calculations
blob = TextBlob(filtered_tweet)
blob_2 = TextBlob(filtered_tweet, analyzer=NaiveBayesAnalyzer())
Sentiment = blob.sentiment
Sentiment_2 = blob_2.sentiment
#seperate polarity and subjectivity in to two variables
polarity = Sentiment.polarity
subjectivity = Sentiment.subjectivity
positivity = Sentiment_2.p_pos
negativity = Sentiment_2.p_neg
#new entry append
new_entry += [status['id'], status['created_at'],
status['source'],
#status['full_text'],
filtered_tweet, Sentiment,polarity,subjectivity, positivity, negativity, status['lang'],
status['favorite_count'], status['retweet_count']]
#to append original author of the tweet
new_entry.append(status['user']['screen_name'])
try:
is_sensitive = status['possibly_sensitive']
except KeyError:
is_sensitive = None
new_entry.append(is_sensitive)
# hashtagas and mentiones are saved using comma separted
hashtags = ", ".join([hashtag_item['text'] for hashtag_item in status['entities']['hashtags']])
new_entry.append(hashtags)
mentions = ", ".join([mention['screen_name'] for mention in status['entities']['user_mentions']])
new_entry.append(mentions)
#get location of the tweet if possible
try:
location = status['user']['location']
except TypeError:
location = ''
new_entry.append(location)
try:
coordinates = [coord for loc in status['place']['bounding_box']['coordinates'] for coord in loc]
except TypeError:
coordinates = None
new_entry.append(coordinates)
single_tweet_df = pd.DataFrame([new_entry], columns=COLS)
#print(single_tweet_df)
df = df.append(single_tweet_df, ignore_index=True)
csvFile = open(file, 'a' ,encoding='utf-8')
except Exception, e:
pass
#csvFile = open(file, 'a' ,encoding='utf-8')
df.to_csv(csvFile, mode='a', columns=COLS, index=False, encoding="utf-8")
write_tweets(users_list, test_file)
Output would be a few indicators of sentiment, like positivity, negativity, neutral etc.
My question is, that maybe some of you has done this kind of thing already and can give me some recommendations about it? My version of it seems very slow and not very efficient (for me, at least).
Thanks in advance
Related
I scraped twitter data and tried to translate the whole dataset (around 30.000) with deeptranslator from googletrans in Python. To overcome the api limit, I subdivided the dataset in smaller batches. But there is still a problem, probably with the nan values in user_description? Could someone help out?
batch_1 = no_eng_data.iloc[0:1000]
batch_2 = no_eng_data.iloc[1000:2000]
def batch_translation(batch):
data_list = []
for index, row in batch.iterrows():
if row["user_description"] != row["user_description"]:
text = ""
else:
text = row["user_description"]
#text = row['user_description']
translated = GoogleTranslator(source='auto', target='en').translate(text=text)
data_list.append(translated)
return data_list
data_list_1 = batch_translation(batch_1)
with open('data_list_1.pkl', 'wb') as f:
pickle.dump(data_list_1, f)
time.sleep(300)
error: NotValidPayload: --> text must be a valid text with maximum 5000 character,
otherwise it cannot be translated ยดยดยด
I'm a relative novice at python but yet, somehow managed to build a scraper for Instagram. I now want to take this one step further and output the 5 most commonly used hashtags from an IG profile into my CSV output file.
Current output:
I've managed to isolate the 5 most commonly used hashtags, but I get this result in my csv:
[('#striveforgreatness', 3), ('#jamesgang', 3), ('#thekidfromakron',
2), ('#togetherwecanchangetheworld', 1), ('#halloweenchronicles', 1)]
Desired output:
What I'm looking to end up with in the end is having 5 columns at the end of my .CSV outputting the X-th most commonly used value.
So something in the lines of this:
I've Googled for a while and managed to isolate them separately, but I always end up with '('#thekidfromakron', 2)' as an output. I seem to be missing some part of the puzzle :(.
Here is what I'm working with at the moment:
import csv
import requests
from bs4 import BeautifulSoup
import json
import re
import time
from collections import Counter
ts = time.gmtime()
def get_csv_header(top_numb):
fieldnames = ['USER','MEDIA COUNT','FOLLOWERCOUNT','TOTAL LIKES','TOTAL COMMENTS','ER','ER IN %', 'BIO', 'ALL CAPTION TEXT','HASHTAGS COUNTED','MOST COMMON HASHTAGS']
return fieldnames
def write_csv_header(filename, headers):
with open(filename, 'w', newline='') as f_out:
writer = csv.DictWriter(f_out, fieldnames=headers)
writer.writeheader()
return
def read_user_name(t_file):
with open(t_file) as f:
user_list = f.read().splitlines()
return user_list
if __name__ == '__main__':
# HERE YOU CAN SPECIFY YOUR USERLIST FILE NAME,
# Which contains a list of usernames's BY DEFAULT <current working directory>/userlist.txt
USER_FILE = 'userlist.txt'
# HERE YOU CAN SPECIFY YOUR DATA FILE NAME, BY DEFAULT (data.csv)', Where your final result stays
DATA_FILE = 'users_with_er.csv'
MAX_POST = 12 # MAX POST
print('Starting the engagement calculations... Please wait until it finishes!')
users = read_user_name(USER_FILE)
""" Writing data to csv file """
csv_headers = get_csv_header(MAX_POST)
write_csv_header(DATA_FILE, csv_headers)
for user in users:
post_info = {'USER': user}
url = 'https://www.instagram.com/' + user + '/'
#for troubleshooting, un-comment the next two lines:
#print(user)
#print(url)
try:
r = requests.get(url)
if r.status_code != 200:
print(timestamp,' user {0} not found or page unavailable! Skipping...'.format(user))
continue
soup = BeautifulSoup(r.content, "html.parser")
scripts = soup.find_all('script', type="text/javascript", text=re.compile('window._sharedData'))
stringified_json = scripts[0].get_text().replace('window._sharedData = ', '')[:-1]
j = json.loads(stringified_json)['entry_data']['ProfilePage'][0]
timestamp = time.strftime("%d-%m-%Y %H:%M:%S", ts)
except ValueError:
print(timestamp,'ValueError for username {0}...Skipping...'.format(user))
continue
except IndexError as error:
# Output expected IndexErrors.
print(timestamp, error)
continue
if j['graphql']['user']['edge_followed_by']['count'] <=0:
print(timestamp,'user {0} has no followers! Skipping...'.format(user))
continue
if j['graphql']['user']['edge_owner_to_timeline_media']['count'] <12:
print(timestamp,'user {0} has less than 12 posts! Skipping...'.format(user))
continue
if j['graphql']['user']['is_private'] is True:
print(timestamp,'user {0} has a private profile! Skipping...'.format(user))
continue
media_count = j['graphql']['user']['edge_owner_to_timeline_media']['count']
accountname = j['graphql']['user']['username']
followercount = j['graphql']['user']['edge_followed_by']['count']
bio = j['graphql']['user']['biography']
i = 0
total_likes = 0
total_comments = 0
all_captiontext = ''
while i <= 11:
total_likes += j['graphql']['user']['edge_owner_to_timeline_media']['edges'][i]['node']['edge_liked_by']['count']
total_comments += j['graphql']['user']['edge_owner_to_timeline_media']['edges'][i]['node']['edge_media_to_comment']['count']
captions = j['graphql']['user']['edge_owner_to_timeline_media']['edges'][i]['node']['edge_media_to_caption']
caption_detail = captions['edges'][0]['node']['text']
all_captiontext += caption_detail
i += 1
engagement_rate_percentage = '{0:.4f}'.format((((total_likes + total_comments) / followercount)/12)*100) + '%'
engagement_rate = (((total_likes + total_comments) / followercount)/12*100)
#isolate and count hashtags
hashtags = re.findall(r'#\w*', all_captiontext)
hashtags_counted = Counter(hashtags)
most_common = hashtags_counted.most_common(5)
with open('users_with_er.csv', 'a', newline='', encoding='utf-8') as data_out:
print(timestamp,'Writing Data for user {0}...'.format(user))
post_info["USER"] = accountname
post_info["FOLLOWERCOUNT"] = followercount
post_info["MEDIA COUNT"] = media_count
post_info["TOTAL LIKES"] = total_likes
post_info["TOTAL COMMENTS"] = total_comments
post_info["ER"] = engagement_rate
post_info["ER IN %"] = engagement_rate_percentage
post_info["BIO"] = bio
post_info["ALL CAPTION TEXT"] = all_captiontext
post_info["HASHTAGS COUNTED"] = hashtags_counted
csv_writer = csv.DictWriter(data_out, fieldnames=csv_headers)
csv_writer.writerow(post_info)
""" Done with the script """
print('ALL DONE !!!! ')
The code that goes before this simply scrapes the webpage, and compiles all the captions from the last 12 posts into "all_captiontext".
Any help to solve this (probably simple) issue would be greatly appreciated as I've been struggling with this for days (again, I'm a noob :') ).
Replace line
post_info["MOST COMMON HASHTAGS"] = most_common
with:
for i, counter_tuple in enumerate(most_common):
tag_name = counter_tuple[0].replace('#','')
label = "Top %d" % (i + 1)
post_info[label] = tag_name
There's also a bit of code missing. For example, your code doesn't include csv_headers variable, which I suppose would be
csv_headers = post_info.keys()
It also seems that you're opening a file to write just one row. I don't think that's intended, so what you would like to do is to collect the results into a list of dictionaries. A cleaner solution would be to use pandas' dataframe, which you can output straight into a csv file.
most_common being the output of the call to hashtags_counted.most_common, I had a look at the doc here: https://docs.python.org/2/library/collections.html#collections.Counter.most_common
Output if formatted the following : [(key, value), (key, value), ...] and ordered in decreasing importance of number of occurences.
Hence, to get only the name and not the number of occurence, you should replace:
post_info["MOST COMMON HASHTAGS"] = most_common
by
post_info["MOST COMMON HASHTAGS"] = [x[0] for x in most_common]
You have a list of tuple. This statement builds on the fly the list of the first element of each tuple, keeping the sorting order.
I'm trying to convert text file to excel sheet in python. The txt file contains data in the below specified formart
Column names: reg no, zip code, loc id, emp id, lastname, first name. Each record has one or more error numbers. Each record have their column names listed above the values. I would like to create an excel sheet containing reg no, firstname, lastname and errors listed in separate rows for each record.
How can I put the records in excel sheet ? Should I be using regular expressions ? And how can I insert error numbers in different rows for that corresponding record?
Expected output:
Here is the link to the input file:
https://github.com/trEaSRE124/Text_Excel_python/blob/master/new.txt
Any code snippets or suggestions are kindly appreciated.
Here is a draft code. Let me know if any changes needed:
# import pandas as pd
from collections import OrderedDict
from datetime import date
import csv
with open('in.txt') as f:
with open('out.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',', quoting=csv.QUOTE_MINIMAL)
#Remove inital clutter
while("INPUT DATA" not in f.readline()):
continue
header = ["REG NO", "ZIP CODE", "LOC ID", "EMP ID", "LASTNAME", "FIRSTNAME", "ERROR"]; data = list(); errors = list()
spamwriter.writerow(header)
print header
while(True):
line = f.readline()
errors = list()
if("END" in line):
exit()
try:
int(line.split()[0])
data = line.strip().split()
f.readline() # get rid of \n
line = f.readline()
while("ERROR" in line):
errors.append(line.strip())
line = f.readline()
spamwriter.writerow(data + errors)
spamwriter.flush()
except:
continue
# while(True):
# line = f.readline()
Use python-2 to run. The errors are appended as subsequent columns. It's slightly complicated the way you want it. I can fix it if still needed
Output looks like:
You can do this using the openpyxl library which is capable of depositing items directly into a spreadsheet. This code shows how to do that for your particular situation.
NEW_PERSON, ERROR_LINE = 1,2
def Line_items():
with open('katherine.txt') as katherine:
for line in katherine:
line = line.strip()
if not line:
continue
items = line.split()
if items[0].isnumeric():
yield NEW_PERSON, items
elif items[:2] == ['ERROR', 'NUM']:
yield ERROR_LINE, line
else:
continue
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
ws['A2'] = 'REG NO'
ws['B2'] = 'LASTNAME'
ws['C2'] = 'FIRSTNAME'
ws['D2'] = 'ERROR'
row = 2
for kind, data in Line_items():
if kind == NEW_PERSON:
row += 2
ws['A{:d}'.format(row)] = int(data[0])
ws['B{:d}'.format(row)] = data[-2]
ws['C{:d}'.format(row)] = data[-1]
first = True
else:
if first:
first = False
else:
row += 1
ws['D{:d}'.format(row)] = data
wb.save(filename='katherine.xlsx')
This is a screen snapshot of the result.
Here is my situation: My code parses out data from HTML tables that are within emails. The roadblock I'm running into is that some of these tables have blank empty rows right in the middle of the table, as seen in the photo below. This blank space causes my code to fail (IndexError: list index out of range) since it attempts to extract text from the cells.
Is it possible to say to Python: "ok, if you run into this error that comes from these blank rows, just stop there and take the rows you have acquired text from so far and execute the rest of the code on those"...?
That might sound like a dumb solution to this problem but my project involves me taking data from only the most recent date in the table anyway, which is always amongst the first few rows, and always before these blank empty rows.
So if it is possible to say "if you hit this error, just ignore it and proceed" then I would like to learn how to do that. If it's not then I'll have to figure out another way around this. Thanks for any and all help.
The table with the gap:
My code:
from bs4 import BeautifulSoup, NavigableString, Tag
import pandas as pd
import numpy as np
import os
import re
import email
import cx_Oracle
dsnStr = cx_Oracle.makedsn("sole.nefsc.noaa.gov", "1526", "sole")
con = cx_Oracle.connect(user="user", password="password", dsn=dsnStr)
def celltext(cell):
'''
textlist=[]
for br in cell.findAll('br'):
next = br.nextSibling
if not (next and isinstance(next,NavigableString)):
continue
next2 = next.nextSibling
if next2 and isinstance(next2,Tag) and next2.name == 'br':
text = str(next).strip()
if text:
textlist.append(next)
return (textlist)
'''
textlist=[]
y = cell.find('span')
for a in y.childGenerator():
if isinstance(a, NavigableString):
textlist.append(str(a))
return (textlist)
path = 'Z:\\blub_2'
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
html=open(file_path,'r').read()
soup = BeautifulSoup(html, 'lxml') # Parse the HTML as a string
table = soup.find_all('table')[1] # Grab the second table
df_Quota = pd.DataFrame()
for row in table.find_all('tr'):
columns = row.find_all('td')
if columns[0].get_text().strip()!='ID': # skip header
Quota = celltext(columns[1])
Weight = celltext(columns[2])
price = celltext(columns[3])
print(Quota)
Nrows= max([len(Quota),len(Weight),len(price)]) #get the max number of rows
IDList = [columns[0].get_text()] * Nrows
DateList = [columns[4].get_text()] * Nrows
if price[0].strip()=='Package':
price = [columns[3].get_text()] * Nrows
if len(Quota)<len(Weight):#if Quota has less itmes extend with NaN
lstnans= [np.nan]*(len(Weight)-len(Quota))
Quota.extend(lstnans)
if len(price) < len(Quota): #if price column has less items than quota column,
val = [columns[3].get_text()] * (len(Quota)-len(price)) #extend with
price.extend(val) #whatever is in
#price column
#if len(DateList) > len(Quota): #if DateList is longer than Quota,
#print("it's longer than")
#value = [columns[4].get_text()] * (len(DateList)-len(Quota))
#DateList = value * Nrows
if len(Quota) < len(DateList): #if Quota is less than DateList (due to gap),
stu = [np.nan]*(len(DateList)-len(Quota)) #extend with NaN
Quota.extend(stu)
if len(Weight) < len(DateList):
dru = [np.nan]*(len(DateList)-len(Weight))
Weight.extend(dru)
FinalDataframe = pd.DataFrame(
{
'ID':IDList,
'AvailableQuota': Quota,
'LiveWeightPounds': Weight,
'price':price,
'DatePosted':DateList
})
df_Quota = df_Quota.append(FinalDataframe, ignore_index=True)
#df_Quota = df_Quota.loc[df_Quota['DatePosted']=='5/20']
df_Q = df_Quota['DatePosted'].iloc[0]
df_Quota = df_Quota[df_Quota['DatePosted'] == df_Q]
print (df_Quota)
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
with open(file_path, 'r') as f:
pattern = re.compile(r'Sent:.*?\b(\d{4})\b')
email = f.read()
dates = pattern.findall(email)
if dates:
print("Date:", ''.join(dates))
#cursor = con.cursor()
#exported_data = [tuple(x) for x in df_Quota.values]
#sql_query = ("INSERT INTO ROUGHTABLE(species, date_posted, stock_id, pounds, money, sector_name, ask)" "VALUES (:1, :2, :3, :4, :5, 'NEFS 2', '1')")
#cursor.executemany(sql_query, exported_data)
#con.commit()
#cursor.close()
#con.close()
continue is the keyword to use for skipping empty/problem rows. IndexError is thanks to the attempt to access columns[0] on an empty columns list. So just skip to next row when there is an exception.
for row in table.find_all('tr'):
columns = row.find_all('td')
try:
if columns[0].get_text().strip()!='ID':
# Rest as above in original code.
except IndexError:
continue
Use try: ... except: ...:
try:
#extract data from table
except IndexError:
#execute rest of program
I have a shapefile (Fireincidents) that currently does not have any data. I want to populate the feature with data from a text file (north_america_fires). The text file contains the lat, long, and confidence of each fire. I've created a new insert cursor to insert new roles. I've also started a for loop to loop through each of the rows in the text file. I am having trouble identifying a method to populate the rows in the point class object. I believe I have to create a list that the data from the text file can be appended to then somehow insert that list of data into the attribute table.
My current code:
try:
work = raw_input("Enter the full path of WildlandFires.mdb: ")
arcpy.env.workspace = work
arcpy.env.overwriteOutput = True
iFile = raw_input("Enter the full path of wildfire text file: ")
fields = ["SHAPE#", "CONFIDENCEVALUE"]
cur = arcpy.da.InsertCursor("FireIncidents", fields)
f = open(iFile, 'r')
lstFires = f.readlines()
cntr = 0
for fire in lstFires:
if 'Latitude' in fire:
continue
row = line.split(',')
lstValues = []
latitude = row[0].strip()
longitude = row[1].strip()
confid = row[2].strip()
pnt = arcpy.CreateObject("Point")
lstValues.append(pnt)
f.close()
except Exception as e:
print "Error: " + str(e)
print arcpy.GetMessages()
arcpy.AddError(e)
Any guidance would be appreciated.
I just identified the issue. It does have correct lat and long coordinates. Thank you for responding.