Most efficient way to Twitter Stream?

Most efficient way to Twitter Stream? - python

My partner and I started learning Python at the beginning of the year. I am at the point where a) my partner and I are almost finished with our code, but b) are pulling our hair out trying to get it to work.
Assignment: Pull 250 tweets based on a certain topic, geocode location of tweets, analyze based on sentiment, then display them on a web-map. We have accomplished almost all of that except the 250 tweets requirement.
And I do not know how to pull the tweets more efficiently. The code works, but it writes around seven-twelve rows of information onto a CSV before it times out.
I tried setting a tracking parameter, but received this error: TypeError: 'NoneType' object is not subscriptable'
I tried expanding the locations parameter to stream.filter(locations=[-180,-90,180,90]), but received the same problem: TypeError: 'NoneType' object has no attribute 'latitude'
I really do not know what I am missing and I was wondering if anyone has any ideas.
CODE BELOW:
from geopy import geocoders
from geopy.exc import GeocoderTimedOut
import tweepy
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
from textblob import TextBlob
import json
import csv
def geo(location):
g = geocoders.Nominatim(user_agent='USER')
if location is not None:
loc = g.geocode(location, timeout=None)
if loc.latitude and loc.longitude is not None:
return loc.latitude, loc.longitude
def WriteCSV(user, text, sentiment, lat, long):
f = open('D:/PATHWAY/TO/tweets.csv', 'a', encoding="utf-8")
write = csv.writer(f)
write.writerow([user, text, sentiment, lat, long])
f.close()
CK = ''
CS = ''
AK = ''
AS = ''
auth = tweepy.OAuthHandler(CK, CS)
auth.set_access_token(AK, AS)
#By setting these values to true, our code will automatically wait as it hits its limits
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
#Now I'm going to set up a stream listener
#https://stackoverflow.com/questions/20863486/tweepy-streaming-stop-collecting-tweets-at-x-amount
#https://wafawaheedas.gitbooks.io/twitter-sentiment-analysis-visualization-tutorial/sentiment-analysis-using-textblob.html
class StdOutListener(tweepy.StreamListener):
def __init__(self, api=None):
super(StdOutListener, self).__init__()
self.num_tweets = 0
def on_data(self, data):
Data = json.loads(data)
Author = Data['user']['screen_name']
Text = Data['text']
Tweet = TextBlob(Data["text"])
Sentiment = Tweet.sentiment.polarity
x,y = geo(Data['place']['full_name'])
if "coronavirus" in Text:
WriteCSV(Author, Text, Sentiment, x,y)
self.num_tweets += 1
if self.num_tweets < 50:
return True
else:
return False
stream = tweepy.Stream(auth=api.auth, listener=StdOutListener())
stream.filter(locations=[-122.441, 47.255, -122.329, 47.603])

The Twitter and Geolocation API returns all kinds of data. Some of the fields may be missing.
TypeError: 'NoneType' object has no attribute 'latitude'
This error comes from here:
loc = g.geocode(location, timeout=None)
if loc.latitude and loc.longitude is not None:
return loc.latitude, loc.longitude
You provide a location and it searches for such location but it cannot find that location. So it writes into loc None.
Consequently loc.latitude won't work because loc is None.
You should check loc first before accessing any of its attributes.
x,y = geo(Data['place']['full_name'])
I know you are filtering tweets by location and consequently your Twitter Status object should have Data['place']['full_name']. But this is not always the case. You should check if the key really do exist before accessing the values.
This applies generally and should be applied to your whole code. Write robust code. You will have a bit of easier time debugging mistakes if you implement some try catch and print out the objects to see how they are built. Maybe set a breakpoint in your catch and do some live inspection.

Related

TypeError: 'NoneType' object is not subscriptable for JSON parsing

I am trying to extract event's fee information from this website from pages 1 to 20 using python. The event's fee is from external URL. Therefore, i need to parse it using json load and extract the pattern from the json file. I have try the code from this post and get the following error TypeError: 'NoneType' object is not subscriptable. Based on my research, it means that the object equal to None and therefore, not subscriptable.
I have try to assign the object to 'NA' if value is None but still no successful. I would appreciate if anyone can kindly explain. Below is the code that i have try:
import re
import json
import requests
event_fees = []
for i in range(20):
urls = 'https://www.eventbrite.com/d/malaysia--kuala-lumpur--85675181/all-events/?page=' + str(i)
events_url = 'https://www.eventbrite.com/api/v3/destination/events/?event_ids={event_ids}&expand=event_sales_status,primary_venue,image,saves,my_collections,ticket_availability&page_size=99999'
html_text = requests.get(urls).text
data1 = json.loads( re.search(r'window\.__SERVER_DATA__ = ({.*});', html_text).group(1) )
event_ids = ','.join(r['id'] for r in data1['search_data']['events']['results'])
data2 = requests.get(events_url.format(event_ids=event_ids)).json()
for e in data2['events']:
fees = (e['ticket_availability']['minimum_ticket_price']['display'],'to',e['ticket_availability']['maximum_ticket_price']['display'])
if fees is None:
event_fees.append("NA")
else:
event_fees.append(fees)

Check if there is a key in that object before use subscription(obj['key']). In this case data2[6]['ticket_availability']['minimum_ticket_price'] is None.

How to get full text in Twitter search API?

I have used twitter search API and also incorporated the extended_mode and full_text attributes but I am still getting a truncated string from the API
Here is my code:
results = t.search(q='tuberculosis', count=50, lang='en', result_type='popular',tweet_mode='extended')
all_tweets = results['statuses']
for tweet in all_tweets:
tweetString = tweet["full_text"]
userMentionList = tweet["entities"]["user_mentions"]
if len(userMentionList)>0:
for eachUserMention in userMentionList:
name = eachUserMention["screen_name"]
time = tweet["created_at"]
wks.insert_rows(wks.rows, values=[tweetString, name, time], inherit=True)

if you are using TwitterSearch, following should work:
tso = TwitterSearchOrder()
tso.set_keywords('tuberculosis')
for tweet in ts.search_tweets_iterable(tso):
print(tweet['text'])
you can set your desired attributes like language and count of course

Writting data from pubsub to bigtable via cloud functions

I am a beginner at cloud big table and have big issues using cloud functions writing data from pub/sub to bigtable.
Cloud functions gets the messages from pubsub, but the issue is in the next step, writing it into bigtable.
The message is created in a python script and sent to pub/sub.
One example for a message:
b'{"eda":2.015176,"temperature":33.39,"bvp":-0.49,"x_acc":-36.0,"y_acc":-38.0,"z_acc":-128.0,"heart_rate":83.78,"iddevice":15.0,"timestamp":"2019-12-01T20:01:36.927Z"}'
For writing it into bigtable I created a table:
from google.cloud import bigtable
from google.cloud.bigtable import column_family
client = bigtable.Client(project="projectid", admin=True)
instance = client.instance("bigtableinstance")
table = instance.table("bigtable1")
print('Creating the {} table.'.format(table))
print('Creating columnfamily cf1 with Max Version GC rule...')
max_versions_rule = column_family.MaxVersionsGCRule(2)
column_family_id = 'cf1'
column_families = {column_family_id: max_versions_rule}
if not table.exists():
table.create(column_families=column_families)
print("Table {} is created.".format(table))
else:
print("Table {} already exists.".format(table))
This works without problems.
Now I tried to write the message via pub/sub to bigtable with the following python code in cloud functions using the main method:
import json
import base64
import os
from google.cloud import bigtable
from google.cloud.bigtable import column_family, row_filters
project_id = os.environ.get('projetid', 'UNKNOWN')
INSTANCE = 'bigtableinstance'
TABLE = 'bigtable1'
client = bigtable.Client(project=project_id, admin=True)
instance = client.instance(INSTANCE)
colFamily = "cf1"
def writeToBigTable(table, data):
# Parameters row_key (bytes) – The key for the row being created.
# Returns A row owned by this table.
row_key = data[colFamily]['iddevice'].value.encode()
row = table.row(row_key)
for colFamily in data.keys():
for key in data[colFamily].keys():
row.set_cell(colFamily,
key,
data[colFamily][key])
table.mutate_rows([row])
return data
def selectTable():
stage = os.environ.get('stage', 'dev')
table_id = TABLE + stage
table = instance.table(table_id)
return table
def main(event, context):
data = base64.b64decode(event['data']).decode('utf-8')
print("DATA: {}".format(data))
eda, temperature, bvp, x_acc, y_acc, z_acc, heart_rate, iddevice, timestamp = data.split(',')
table = selectTable()
data = {'eda': eda,
'temperature': temperature,
'bvp': bvp,
'x_acc':x_acc,
'y_acc':y_acc,
'z_acc':z_acc,
'heart_rate':heart_rate,
'iddevice':iddevice,
'timestamp':timestamp}
writeToBigTable(table, data)
print("Data Written: {}".format(data))
I tried different versions but cannot find a solution.
Thanks for the help.
All the best
Dominik

I think this line is wrong:
row_key = data[colFamily]['iddevice'].value.encode()
You're passing in the data object, but it doesn't have a 'cf1' property. You also don't have to encode it. Give this a try:
row_key = data['iddevice']
Your for loop will also have the same issue. I think this is what you want instead
for col in data.keys():
row.set_cell(colFamily, key, data[key])
Also, I know you're just playing with it, but using a device id as the only value for a rowkey will end up poorly. What is recommended might be to combine the rowkey and the date or one of your other properties (depending on your query,) and use that as your rowkey. There is a document on Cloud Bigtable schema that is helpful, and a codelab using a more realistic sample dataset and walks through how to pick a schema for that example. It's in Java, but you can still import the data and run your own queries.

first thanks a lot for the help.
I tried to fix it with you code recommendation which is , but unfortunately it doesn't work now due to other errors.
AttributeError: 'DirectRow' object has no attribute 'append'
I guess this is within the following line of code
row.set_cell(colFamily,
key,
data[key])
I could imagine that the errors origin is in the split of the string "data"
eda, temperature, bvp, x_acc, y_acc, z_acc, heart_rate, iddevice, timestamp = data.split(',')
E.g. eda would look like this:
"'eda':2.015176"
which looks pretty wrong to me.
Especially when I insert it into the following dict:
data = {'eda': eda,....}
The error
AttributeError: 'DirectRow' object has no attribute 'append'
seems to say, that there is a problem with the data I want to process with set_cell. There is said set_cell with row as a list or any other iterable of Direct Row Instance. Shouldn't fit a dic for it?
I tried a workaround with a list, but this seems to make it even worse.
client = bigtable.Client(project=project_id, admin=True)
instance = client.instance(INSTANCE)
colFamily = "cf1"
def writeToBigTable(table, dat):
row_key = "{}-{}".format(dat[16], dat[17])
row = table.row(row_key)
for n in range(len(dat)):
row.set_cell(colFamily,
dat[n],
dat[n+9])
table.mutate_rows([row])
return dat
def selectTable():
stage = os.environ.get('stage', 'dev')
table_id = TABLE + stage
table = instance.table(table_id)
return table
def main(event, context):
data = base64.b64decode(event['data']).decode('utf-8')
print("DATA: {}".format(data))
var_1, eda, var_2, temperature, var_3, bvp, var_4, x_acc, var_5, y_acc, var_6, z_acc, var_7, heart_rate, var_8, iddevice, var_9, timestamp = data.replace(':',',').split(',')
table = selectTable(); dat = [var_1, var_2, var_3, var_4, var_5, var_6, var_7, var_8, var_9, eda, temperature, bvp, x_acc, y_acc, z_acc, heart_rate, iddevice, timestamp];
# data = {'eda': eda,
# 'temperature': temperature,
# 'bvp': bvp,
# 'x_acc':x_acc,
# 'y_acc':y_acc,
# 'z_acc':z_acc,
# 'heart_rate':heart_rate,
# 'iddevice':iddevice,
# 'timestamp':timestamp}
writeToBigTable(table, dat)
print("Data Written: {}".format(data))
I am really hard stuck at this problem and have no further ideas how to solve it.

return actual tweets in tweepy?

I was writing a twitter program using tweepy. When I run this code, it prints the Python ... values for them, like
<tweepy.models.Status object at 0x95ff8cc>
Which is not good. How do I get the actual tweet?
import tweepy, tweepy.api
key = XXXXX
sec = XXXXX
tok  = XXXXX
tsec = XXXXX
auth = tweepy.OAuthHandler(key, sec)
auth.set_access_token(tok, tsec)
api = tweepy.API(auth)
pub = api.home_timeline()
for i in pub:
        print str(i)

In general, you can use the dir() builtin in Python to inspect an object.
It would seem the Tweepy documentation is very lacking here, but I would imagine the Status objects mirror the structure of Twitter's REST status format, see (for example) https://dev.twitter.com/docs/api/1/get/statuses/home_timeline
So -- try
print dir(status)
to see what lives in the status object
or just, say,
print status.text
print status.user.screen_name

Have a look at the getstate() get method which can be used to inspect the returned object
for i in pub:
print i.__getstate__()

The api.home_timeline() method returns a list of 20 tweepy.models.Status objects which correspond to the top 20 tweets. That is, each Tweet is considered as an object of Status class. Each Status object has a number of attributes like id, text, user, place, created_at, etc.
The following code would print the tweet id and the text :
tweets = api.home_timeline()
for tweet in tweets:
print tweet.id, " : ", tweet.text

from actual tweets,if u want specific tweet,u must have a tweet id,
and use
tweets = self.api.statuses_lookup(tweetIDs)
for tweet in tweets:
#tweet obtained
print(str(tweet['id'])+str(tweet['text']))
or if u want tweets in general
use twitter stream api
class StdOutListener(StreamListener):
def __init__(self, outputDatabaseName, collectionName):
try:
print("Connecting to database")
conn=pymongo.MongoClient()
outputDB = conn[outputDatabaseName]
self.collection = outputDB[collectionName]
self.counter = 0
except pymongo.errors.ConnectionFailure as e:
print ("Could not connect to MongoDB:")
def on_data(self,data):
datajson=json.loads(data)
if "lang" in datajson and datajson["lang"] == "en" and "text" in datajson:
self.collection.insert(datajson)
text=datajson["text"].encode("utf-8") #The text of the tweet
self.counter += 1
print(str(self.counter) + " " +str(text))
def on_error(self, status):
print("ERROR")
print(status)
def on_connect(self):
print("You're connected to the streaming server.
l=StdOutListener(dbname,cname)
auth=OAuthHandler(Auth.consumer_key,Auth.consumer_secret)
auth.set_access_token(Auth.access_token,Auth.access_token_secret)
stream=Stream(auth,l)
stream.filter(track=stopWords)
create a class Stdoutlistener which is inherited from StreamListener
override function on_data,and tweet is returned in json format,this function runs every time tweet is obtained
tweets are filtered accrding to stopwords
which is list of u words u wants in ur tweets

On a tweepy Status instance you can can access the _json attribute, which returns a dict representing the original Tweet contents.
For example:
type(status)
# tweepy.models.Status
type(status._json)
# dict
status._json.keys()
# dict_keys(['favorite_count', 'contributors', 'id', 'user', ...])

Fetching language detection from Google api

I have a CSV with keywords in one column and the number of impressions in a second column.
I'd like to provide the keywords in a url (while looping) and for the Google language api to return what type of language was the keyword in.
I have it working manually. If I enter (with the correct api key):
http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&key=myapikey&q=merde
I get:
{"responseData": {"language":"fr","isReliable":false,"confidence":6.213709E-4}, "responseDetails": null, "responseStatus": 200}
which is correct, 'merde' is French.
so far I have this code but I keep getting server unreachable errors:
import time
import csv
from operator import itemgetter
import sys
import fileinput
import urllib2
import json
E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2
#not working
def parse_result(result):
"""Parse a JSONP result string and return a list of terms"""
# Deserialize JSON to Python objects
result_object = json.loads(result)
#Get the rows in the table, then get the second column's value
# for each row
return row in result_object
#not working
def retrieve_terms(seedterm):
print(seedterm)
"""Retrieves and parses data and returns a list of terms"""
url_template = 'http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&key=myapikey&q=%(seed)s'
url = url_template % {"seed": seedterm}
try:
with urllib2.urlopen(url) as data:
data = perform_request(seedterm)
result = data.read()
except:
sys.stderr.write('%s\n' % 'Could not request data from server')
exit(E_OPERATION_ERROR)
#terms = parse_result(result)
#print terms
print result
def main(argv):
filename = argv[1]
csvfile = open(filename, 'r')
csvreader = csv.DictReader(csvfile)
rows = []
for row in csvreader:
rows.append(row)
sortedrows = sorted(rows, key=itemgetter('impressions'), reverse = True)
keys = sortedrows[0].keys()
for item in sortedrows:
retrieve_terms(item['keywords'])
try:
outputfile = open('Output_%s.csv' % (filename),'w')
except IOError:
print("The file is active in another program - close it first!")
sys.exit()
dict_writer = csv.DictWriter(outputfile, keys, lineterminator='\n')
dict_writer.writer.writerow(keys)
dict_writer.writerows(sortedrows)
outputfile.close()
print("File is Done!! Check your folder")
if __name__ == '__main__':
start_time = time.clock()
main(sys.argv)
print("\n")
print time.clock() - start_time, "seconds for script time"
Any idea how to finish the code so that it will work? Thank you!

Try to add referrer, userip as described in the docs:
An area to pay special attention to
relates to correctly identifying
yourself in your requests.
Applications MUST always include a
valid and accurate http referer header
in their requests. In addition, we
ask, but do not require, that each
request contains a valid API Key. By
providing a key, your application
provides us with a secondary
identification mechanism that is
useful should we need to contact you
in order to correct any problems. Read
more about the usefulness of having an
API key
Developers are also encouraged to make
use of the userip parameter (see
below) to supply the IP address of the
end-user on whose behalf you are
making the API request. Doing so will
help distinguish this legitimate
server-side traffic from traffic which
doesn't come from an end-user.
Here's an example based on the answer to the question "access to google with python":
#!/usr/bin/python
# -*- coding: utf-8 -*-
import json
import urllib, urllib2
from pprint import pprint
api_key, userip = None, None
query = {'q' : 'матрёшка'}
referrer = "https://stackoverflow.com/q/4309599/4279"
if userip:
query.update(userip=userip)
if api_key:
query.update(key=api_key)
url = 'http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&%s' %(
urllib.urlencode(query))
request = urllib2.Request(url, headers=dict(Referer=referrer))
json_data = json.load(urllib2.urlopen(request))
pprint(json_data['responseData'])
Output
{u'confidence': 0.070496580000000003, u'isReliable': False, u'language': u'ru'}
Another issue might be that seedterm is not properly quoted:
if isinstance(seedterm, unicode):
value = seedterm
else: # bytes
value = seedterm.decode(put_encoding_here)
url = 'http://...q=%s' % urllib.quote_plus(value.encode('utf-8'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Most efficient way to Twitter Stream? - python

Related

TypeError: 'NoneType' object is not subscriptable for JSON parsing

How to get full text in Twitter search API?

Writting data from pubsub to bigtable via cloud functions

return actual tweets in tweepy?

Fetching language detection from Google api

Categories

Resources