Downloading all Tweets about certain subject in Python - python

Im doing Twitter sentiment research at the moment. For this reason, I'm using the Twitter API to download all tweets on certain keywords. But my current code is taking a lot of time to create a large datafile, so I was wondering if there's a faster method.
This is what Im using right now:
__author__ = 'gerbuiker'
import time
#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
#Variables that contains the user credentials to access Twitter API
access_token = "XXXXXXXXXXXXX"
access_token_secret = "XXXXXXXX"
consumer_key = "XXXXX"
consumer_secret = "XXXXXXXXXXXXXX"
#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):
def on_data(self, data):
try:
#print data
tweet = data.split(',"text":"')[1].split('","source')[0]
print tweet
saveThis = str(time.time())+ '::'+ tweet #saves time+actual tweet
saveFile = open('twitiamsterdam.txt','a')
saveFile.write(saveThis)
saveFile.write('\n')
saveFile.close()
return True
except BaseException, e:
print 'failed ondata,',str(e)
time.sleep(5)
def on_error(self, status):
print status
if __name__ == '__main__':
#This handles Twitter authetification and the connection to Twitter Streaming API
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
#This line filter Twitter Streams to capture data by the keywords: 'Amsterdam'
stream.filter(track=['KEYWORD which i want to check'])
This gets me about 1500 tweets in one hour, for a pretty popular keyword (Amsterdam). Does anyone now a faster method in Python?
To be clear: I want to download all tweets on a certain subject for last month/year for example. So the newest tweets don't have to keep coming in, the most recent ones for a period would be sufficient. Thanks!

I need something similar to this for an academic research.
We're you able to fix it?
Would it be possible to specify a custom range of time from which to pull the data?
Sorry for asking here, but couldn't send you private messages.

Related

How to get twitter data of tweets within a certain time frame?

what do I put in my code to make it where I can force the program to stop printing data when the tweets data back to a certain point. For example, how can I get all tweets about Verratti from within a month of running this?
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import json
access_token = the code
access_token_secret = the code
consumer_key = the code
consumer_secret = the code
#print
class StdOutListener(StreamListener):
def on_data(self, data):
print (json.loads(data)['text'])
return True
def on_error(self, status):
print (status)
#find
if __name__ == '__main__':
#This handles Twitter authetification and the connection to Twitter Streaming API
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
#This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
stream.filter(track=['Verratti'])
Nice question. It turns out that the Twitter API only lets you look back one week from the current date. There is a way around it though, someone made a github library that can search for any timeframe using twitter's advanced search function and you don't even have to bother with the whole authentication process.
Check it out: https://github.com/Jefferson-Henrique/GetOldTweets-python

Extract 1000 URI's from Twitter using Tweepy and Python

I am trying to extract 1000 unique, fully extended URI's from Twitter using Tweepy and Python. Specifically, I am interested in links that direct me outside of Twitter (so not back to other tweets/ retweets/ duplicates).
The code I wrote keeps giving me a Key error for "entities."
It will give me a few urls before breaking; some are extended, some are not. I have no idea how to go about fixing this.
Help me please!
Note: I left my credentials out.
Here is my code:
# Import the necessary methods from different libraries
import tweepy
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import json
# Variables that contains the user credentials to access Twitter API
access_token = "enter token here"
access_token_secret = "enter token here"
consumer_key = "enter key here"
consumer_secret = "enter key here"
# Accessing tweepy API
# api = tweepy.API(auth)
# This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):
def on_data(self, data):
# resource: http://code.runnable.com/Us9rrMiTWf9bAAW3/how-to- stream-data-from-twitter-with-tweepy-for-python
# Twitter returns data in JSON format - we need to decode it first
decoded = json.loads(data)
# resource: http://socialmedia-class.org/twittertutorial.html
# Print each tweet in the stream to the screen
# Here we set it to stop after getting 1000 tweets.
# You don't have to set it to stop, but can continue running
# the Twitter API to collect data for days or even longer.
count = 1000
for url in decoded["entities"]["urls"]:
count -= 1
print "%s" % url["expanded_url"] + "\r\n\n"
if count <= 0:
break
def on_error(self, status):
print status
if __name__ == '__main__':
# This handles Twitter authetification and the connection to Twitter Streaming API
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
# This line filter Twitter Streams to capture data by the keyword: YouTube
stream.filter(track=['YouTube'])
It seems like the API is hitting a rate limit, so one option is to include an Exception when it gets a KeyError, I then see [u'limit']. I added a count display to verify it does get to 1000:
count = 1000 # moved outside of class definition to avoid getting reset
class StdOutListener(StreamListener):
def on_data(self, data):
decoded = json.loads(data)
global count # get the count
if count <= 0:
import sys
sys.exit()
else:
try:
for url in decoded["entities"]["urls"]:
count -= 1
print count,':', "%s" % url["expanded_url"] + "\r\n\n"
except KeyError:
print decoded.keys()
def on_error(self, status):
print status
if __name__ == '__main__':
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
stream.filter(track=['YouTube'])

Twitter streaming formatting JSON Output

Maybe you can help me. This following python code retrieves Twitter Streaming data and stops when 1000 tweet data are got. It works but returns the fields "created_at, screen_name, and text" separated by tab. Instead I'd like to get the data in JSON format. How can I set the code in order to get the data formatted in JSON?
# Import the necessary package to process data in JSON format
try:
import json
except ImportError:
import simplejson as json
# Import the necessary methods from "twitter" library
from twitter import Twitter, OAuth, TwitterHTTPError, TwitterStream
# Variables that contains the user credentials to access Twitter API
CONSUMER_KEY = '7pWHWtYlXM9ayJfUKv2F8v84B'
CONSUMER_SECRET = 'Dfcx10Px77Ggn0qGbCHc4TZC7M2IHsXpqk9CaGiCLzcr9VMX5n'
ACCESS_TOKEN = '245080367-zuLrIbxblOnocashgku9dsmDKgy3R7uU0VCTIRDx'
ACCESS_SECRET = 'wCx5ufD9Zft46hVjieLdv0af7p9DxUTsPgge9Zm2qelR9'
oauth = OAuth(ACCESS_TOKEN, ACCESS_SECRET, CONSUMER_KEY, CONSUMER_SECRET)
# Initiate the connection to Twitter Streaming API
twitter_stream = TwitterStream(auth=oauth)
# Get a sample of the public data following through Twitter
#iterator = twitter_stream.statuses.sample()
iterator = twitter_stream.statuses.filter(track="Euro2016", language="fr")
tweet_count = 1000
for tweet in iterator:
tweet_count -= 1
print (tweet['created_at'],"\t",tweet['user']['screen_name'],"\t",tweet['geo'], "\t",tweet['text'])
if tweet_count <= 0:
break
You can import tweepy (you need to install it first with pip) and override the listener class to be able to output the data in json format. Here is an example:
from tweepy import Stream
from tweepy.streaming import StreamListener
#Listener Class Override
class listener(StreamListener):
def on_data(self, data):
try:
tweet = json.loads(data)
with open('your_data.json', 'a') as my_file:
json.dump(tweet, my_file)
except BaseException:
print('Error')
pass
def on_error(self, status):
print(statuses)
my_listener=listener()
twitterStream = Stream(oauth, my_listener) #Inizialize Stream object
You can read more about tweepy here: http://docs.tweepy.org/en/v3.4.0/streaming_how_to.html

Twitter Streaming API with Python Tweepy

I've been playing with the Twitter Streaming API using the Tweepy library. I started by following my own account and streaming my own tweets as I posted them, which worked fine.
I then attempted to stream a fairly large region's tweets ([30,-85,31,-84]), to which I initially seemed to receive no data. I then started receiving 'Location Deletion Notices', or 'scrub_geo' messages, and have only ever received those since. I changed my code back to the previously working follow code, but I continue to receive 'scrub_geo' messages and not statuses from my profile.
Here's the script I'm using:
# Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
# Other libs
import json
# Variables that contains the user credentials to access Twitter API
access_token = "<my_access_token>"
access_token_secret = "<my_secret_token>"
consumer_key = "<my_consumer_key>"
consumer_secret = "<my_consumer_secret>"
# This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):
def on_data(self, data):
#try:
# json_data = json.loads(data)
# print json_data['created_at'] + " " + data['text']
#except:
print "Data " + str(data)
return True
def on_error(self, status):
print "Error " + str(status)
if status == 420:
print("420 error.")
return False
if __name__ == '__main__':
# This handles Twitter authetification and the connection to Twitter Streaming API
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
# Start streaming with right parameters
#tallahassee=[30,-85,31,-84]
#stream.filter(locations=tallahassee) <---- previously used
stream.filter(follow="<my_user_id>")
Your coordinates are reversed. Since we're dealing with GeoJSON always do (long,lat,alt) or (x,y,z)
So you'll need to provide tallahassee=[-85,30,-84,31]. Always provide longitude first same as you would do (x,y) in math.
There are some places, like google maps, that do latitude first. You just have to be careful as to which proper format you're dealing with.

401 Error when retrieving Twitter data using Tweepy

I am trying to retrieve Twitter data using Tweepy, using that below code, but I'm returning 401 error, and I regenerate the access and secret tokens, and the same error appeared.
#imports
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
#setting up the keys
consumer_key = 'xxxxxxx'
consumer_secret = 'xxxxxxxx'
access_token = 'xxxxxxxxxx'
access_secret = 'xxxxxxxxxxxxx'
class TweetListener(StreamListener):
# A listener handles tweets are the received from the stream.
#This is a basic listener that just prints received tweets to standard output
def on_data(self, data):
print (data)
return True
def on_error(self, status):
print (status)
#printing all the tweets to the standard output
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
stream = Stream(auth, TweetListener())
t = u"#سوريا"
stream.filter(track=[t])
Just reset your system's clock.
If an API request to authenticate comes from a server that claims it is a time that is outside of 15 minutes of Twitter time, it will fail with a 401 error.
ThankYou
You might just have made a mistake in copying the Access Token from the apps.twitter.com page.
You need to copy the entire thing that's given as Access Token, not just the string after the -.
For example, copy and paste the entire string like 74376347-jkghdui456hjkbjhgbm45gj, not just jkghdui456hjkbjhgbm45gj.
[Note the above string is just something I typed randomly for demonstration purpose. Your actual Access token will also look like this though, i.e,
"a string of number-an alphanumeric string"]
you just have to show your keys into the double quote
and you don't have to define your keys in last twitter authentication.
#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
#Variables that contains the user credentials to access Twitter API
access_token = 'X3YIzD'
access_token_secret = 'PiwPirr'
consumer_key = 'ekaOmyGn'
consumer_secret = 'RkFXRIOf83r'
#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):
def on_data(self, data):
print data
return True
def on_error(self, status):
print status
if __name__ == '__main__':
#This handles Twitter authetification and the connection to Twitter
Streaming API
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
#This line filter Twitter Streams to capture data by the keywords: 'python',
'javascript', 'ruby'
stream.filter(track=['python', 'javascript', 'ruby'])
I had the same issue - nothing here fixed it. The trick for me was that Streaming tweets with Tweepy apparently requires 1A authentication, not 2A (see - https://github.com/tweepy/tweepy/issues/1346). This means you need to use an access token as well as the consumer tokens in the authentication object.
import tweepy
# user credentials
access_token = '...'
access_token_secret = '...'
consumer_key = '...'
consumer_secret = '...'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
# this is the main difference
auth.set_access_token(access_token, access_token_secret)
stream = tweepy.Stream(auth, tweepy.StreamListener)
In my case the error occurred because I was using AppAuthHandler rather than OAuthHandler. Switching to OAuthHandler resolved the issue.
In my case, I had this problem but it did not have to do with time.
My app had a "read only" permission.
I had to change it to a "read and write" permission for the error to cease.
You can do this by going to "user authentication" in the app settings page.
After changing your read only permission, you have to regenerate your access token, then put it into your code. Thanks for the help!

Categories

Resources