PROBLEM SOLVED, SEE SOLUTION AT THE END OF THE POST
I need help to estimate running time for my tweepy program calling Twitter Stream API with location filter.
After I kicked it off, it has run for over 20 minutes, which is longer than what I expected. I am new to Twitter Stream API, and have only worked with REST API for couple of days. It looks to me that REST API will give me 50 tweets in a few seconds, easy. But this Stream request is taking a lot more time. My program hasn't died on me or given any error. So I don't know if there's anything wrong with it. If so, please do point out.
In conclusion, if you think my code is correct, could you provide an estimate for the running time? If you think my code is wrong, could you help me to fix it?
Thank you in advance!
Here's the code:
# Import Tweepy, sys, sleep, credentials.py
import tweepy, sys
from time import sleep
from credentials import *
# Access and authorize our Twitter credentials from credentials.py
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
box = [-86.33,41.63,-86.20,41.74]
class CustomStreamListener(tweepy.StreamListener):
def on_error(self, status_code):
print >> sys.stderr, 'Encountered error with status code:', status_code
return True # Don't kill the stream
def on_timeout(self):
print >> sys.stderr, 'Timeout...'
return True # Don't kill the stream
stream = tweepy.streaming.Stream(auth, CustomStreamListener()).filter(locations=box).items(50)
stream
I tried the method from http://docs.tweepy.org/en/v3.4.0/auth_tutorial.html#auth-tutorial Apparently it is not working for me... Here is my code below. Would you mind giving any input? Let me know if you have some working code. Thanks!
# Import Tweepy, sys, sleep, credentials.py
import tweepy, sys
from time import sleep
from credentials import *
# Access and authorize our Twitter credentials from credentials.py
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
# Assign coordinates to the variable
box = [-74.0,40.73,-73.0,41.73]
import tweepy
#override tweepy.StreamListener to add logic to on_status
class MyStreamListener(tweepy.StreamListener):
def on_status(self, status):
print(status.text)
def on_error(self, status_code):
if status_code == 420:
#returning False in on_data disconnects the stream
return False
myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener())
myStream.filter(track=['python'], locations=(box), async=True)
Here is the error message:
Traceback (most recent call last):
File "test.py", line 26, in <module>
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener())
TypeError: 'MyStreamListener' object is not callable
PROBLEM SOLVED! SEE SOLUTION BELOW
After another round of debug, here is the solution for one who may have interest in the same topic:
# Import Tweepy, sys, sleep, credentials.py
try:
import json
except ImportError:
import simplejson as json
import tweepy, sys
from time import sleep
from credentials import *
# Access and authorize our Twitter credentials from credentials.py
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
# Assign coordinates to the variable
box = [-74.0,40.73,-73.0,41.73]
import tweepy
#override tweepy.StreamListener to add logic to on_status
class MyStreamListener(tweepy.StreamListener):
def on_status(self, status):
print(status.text.encode('utf-8'))
def on_error(self, status_code):
if status_code == 420:
#returning False in on_data disconnects the stream
return False
myStreamListener = MyStreamListener()
myStream = tweepy.Stream(api.auth, listener=myStreamListener)
myStream.filter(track=['NYC'], locations=(box), async=True)
Core Problem:
I think you're misunderstanding what the Stream is here.
Tl;dr: Your code is working, you're just not doing anything with the data that gets back.
The rest API call is a single call for information. You make a request, Twitter sends back some information, which gets assigned to your variable.
The StreamObject (which you've created as stream) from Tweepy opens a connection to twitter with your search parameters, and Twitter, well, streams Tweets to it. Forever.
From the Tweepy docs:
The streaming api is quite different from the REST api because the
REST api is used to pull data from twitter but the streaming api
pushes messages to a persistent session. This allows the streaming api
to download more data in real time than could be done using the REST
API.
So, you need to build a handler (streamListener, in tweepy's terminology), like this one that prints out the tweets..
Additional
Word of warning, from bitter experience - if you're going to try and save the tweets to a database: Twitter can, and will, stream objects to you much faster than you can save them to the database. This will result in your Stream being disconnected, because the tweets back up at Twitter, and over a certain level of backed-up-ness (not an actual phrase), they'll just disconnect you.
I handled this by using django-rq to put save jobs into a jobqueue - this way, I could handle hundreds of tweets a second (at peak), and it would smooth out. You can see how I did this below. Python-rq would also work if you're not using django as a framework round it. The read both method is just a function that reads from the tweet and saves it to a postgres database. In my specific case, I did that via the Django ORM, using the django_rq.enqueue function.
__author__ = 'iamwithnail'
from django.core.management.base import BaseCommand, CommandError
from django.db.utils import DataError
from harvester.tools import read_both
import django_rq
class Command(BaseCommand):
args = '<search_string search_string>'
help = "Opens a listener to the Twitter stream, and tracks the given string or list" \
"of strings, saving them down to the DB as they are received."
def handle(self, *args, **options):
try:
import urllib3.contrib.pyopenssl
urllib3.contrib.pyopenssl.inject_into_urllib3()
except ImportError:
pass
consumer_key = '***'
consumer_secret = '****'
access_token='****'
access_token_secret_var='****'
import tweepy
import json
# This is the listener, responsible for receiving data
class StdOutListener(tweepy.StreamListener):
def on_data(self, data):
decoded = json.loads(data)
try:
if decoded['lang'] == 'en':
django_rq.enqueue(read_both, decoded)
else:
pass
except KeyError,e:
print "Error on Key", e
except DataError, e:
print "DataError", e
return True
def on_error(self, status):
print status
l = StdOutListener()
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret_var)
stream = tweepy.Stream(auth, l)
stream.filter(track=args)
Edit: Your subsequent problem is caused by calling the listener wrongly.
myStreamListener = MyStreamListener() #creates an instance of your class
Where you have this:
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener())
You're trying to call the listener as a function when you use the (). So it should be:
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener)
And in fact, can probably just be more succinctly written as:
myStream = tweepy.Stream(api.auth,myStreamListener)
I've been running the script below using tweepy, but the on_direct_message() is never called. I'd like to use this function so I can receive new direct messages. I've used tweepy for the past month without any issue until now. There seem to be others out there will a similar issue: Tweepy streaming: on_direct_message() is never called
I'm on a Mac OS X 10.10.5 and I'm using Python 2.7.
Any help would be really appreciated.
class MyStreamListener(tweepy.StreamListener):
def on_direct_message(self, status):
print "status: "
print status
myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth = api.auth, listener = MyStreamListener(), timeout = None, retry_count = None)
myStream.filter(track=["filter"], async=False)
Hi !
I am experiencing some issue about tweepy library for Python. The first time I launched the below script, everything perfectly worked, and the second time... the script stop unexpectedly.
I did not found anything about this behavior, the Listener is stopping after few seconds, and I do not have any error code or something.
There is the simple code:
import tweepy
import sys
import json
from textwrap import TextWrapper
from datetime import datetime
from elasticsearch import Elasticsearch
consumer_key = "hidden"
consumer_secret = "hidden"
access_token = "hidden"
access_token_secret = "hidden"
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
ES_HOST = {"host" : "localhost", "port" : 9200}
es = Elasticsearch(hosts = [ES_HOST])
class StreamListener(tweepy.StreamListener):
print('Starting StreamListener')
status_wrapper = TextWrapper(width=60, initial_indent=' ', subsequent_indent=' ')
def on_status(self, status):
try:
print 'n%s %s' % (status.author.screen_name, status.created_at)
json_data = status._json
#print json_data['text']
es.create(index="idx_twp",
doc_type="twitter_twp_nintendo",
body=json_data
)
except Exception, e:
print e
pass
print('Starting Receiving')
streamer = tweepy.Stream(auth=auth, listener=StreamListener(), timeout=3000000000)
#Fill with your own Keywords bellow
terms = ['nintendo']
streamer.filter(None,terms)
#streamer.userstream(None)
print ('Ending program')
And then there is the ouput (only 2 seconds);
[root#localhost ~]# python projects/m/twitter/twitter_logs.py
Starting StreamListener
Starting Receiving
Ending program
I am using Python 2.7.5
Any ideas about ?
Hi !
I solved this weird issue by changing my Python version to 3.5 via virtualenv. For now, it works well.
This could was due to the python version, anyway if someone have this, I just recommend to use virtualenv to test another Python version, and see what happens.
FYI : I already opened issue #759 into the github project.
I'm trying to use tweepy to build a dataset of tweets. Right now, I have the stream running for a single search term but I would like to use the library to search for different queries at the same time. I know I am able to supply the twitterStream.filter function with a list instead of just the "Disney" term, however I am not sure how I can see which tweet is a result to which term returned in this case.
What would be a good extension of the following code to search for ["Disney", "Pandabears", "Polarbears"] instead of just "Disney" and know which query returned the hit?
I can think of two ways to do this in principle:
1: Search the resulting tweet for the search terms and tag them accordingly. However, this doesn't really solve the problem as a tweet might contain two of the search terms. Described here
2: Run as many of the streams as there are search terms. However, I'm not sure the API allows the same app to have multiple active streams at once?
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import time
ckey = "secret"
csecret="secret"
atoken="secret"
asecret="secret"
searchterm = "Disney"
class listener(StreamListener):
def on_data(self, data):
try:
tweet = data.split(',"text":"')[1].split('","source')[0]
saveThis = str(time.time())+"::%::"+tweet
saveFile = open("tweets.csv", "a")
saveFile.write(saveThis)
saveFile.write("\n")
saveFile.close()
return True
except BaseException, e:
print "Failed on data", str(e)
time.sleep(10)
return True # Don't kill the stream
def on_error(self, status):
print status
time.sleep(5)
return True # Don't kill the stream
try:
auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
twitterStream = Stream(auth, listener())
twitterStream.filter(track=[searchterm])
except Exception:
print "Failed in auth or streaming"
Is there a "good" way to solve this problem?
I have chosen to go with option 1 and run a single stream with multiple search terms, checking each tweet for matches manually...
tweet = "I am a tweet"
terms = ["am","tweet"]
matches = []
for i, term in enumerate(terms):
if( term.lower() in tweet.lower() ):
matches.append(i)
matches
Out: [0, 1]
...and adding the resulting matches list in the the object returned by the stream listener. Of course, this results in a larger stram, increasing the hazard of being rate limited.
I'm a newbie in programming, but i hope you can help me with my problem. I'm trying to analyse tweets using tweepy/python/stream.api and R (the statistic program).
Right know the stream listener is working, but I can't use the output...
This is the script I'm running:
import tweepy
consumer_key="..."
consumer_secret="..."
access_key = "..."
access_secret = "..."
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
class CustomStreamListener(tweepy.StreamListener):
def on_status(self, status):
print status.text
def on_error(self, status_code):
print >> sys.stderr, 'Encountered error with status code:', status_code
return True # Don't kill the stream
def on_timeout(self):
print >> sys.stderr, 'Timeout...'
return True # Don't kill the stream
sapi = tweepy.streaming.Stream(auth, CustomStreamListener())
sapi.filter(track=['...'])
As a result, I don't get the full tweets (only the first 50 characters), and I can't see the time when it was tweeted. How can i fix this, and is it possible to somehow "print" the output into an Excel file?
Write the output into .csv file or use the xlrd package. As far as the 50 characters is concerned I don't know. Looks like this has to do with the library.
Change your print status.text to make use of xlwt to write directly to a cell in an excel sheet. I've hacked about with it and it's OK, but your code tends to end up quite verbose.
http://pypi.python.org/pypi/xlwt