I'm using the Tweepy API to collect tweets containing specific keywords or hashtags through the standard Academic Research Developer Account. This allows me to collect 10,000,000 tweets per month. I'm trying to collect the tweets from one full calendar date at a time using the full archive search. I've gotten a rate limit error (despite the wait_on_rate_limit flag being set to true) and now this request limit error. I'm not totally sure why or what to change at this point?
consumer_key = '***'
consumer_secret = '***'
access_token = '***'
access_token_secret = '***'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth,wait_on_rate_limit=True)
def get_tweets_withHashTags(query, startdate, enddate, count = 300):
tweets_hlist= []
tweets_list= []
qt=str(query)
for page in tweepy.Cursor(api.search_full_archive, label='myLabel', query=qt, fromDate=startdate+'0000',toDate=enddate+'0000',maxResults=100).pages(100):
count = len(page)
print( "Count of tweets in each page for " + str(qt) + " : " + str(count))
for value in page:
hashList = value._json["entities"]["hashtags"]
flag = 0
for tag in hashList:
if qt.lower() in tag["text"].lower():
flag = 1
if flag==1:
tweets_hlist.append(value._json)
tweets_list.append(value._json)
print("tweets_hash_"+ query +": " + str(len(tweets_hlist)))
print("tweets_"+ query +": " + str(len(tweets_list)))
with open("/Users/Victor/Documents/tweetCollection/data/"+startdate +"/" + "query1_hash_" + str(startdate)+ "_" + str(enddate) + "_" +query+'.json', 'w') as outfile:
json.dump(tweets_hlist, outfile, indent = 2)
with open("/Users/Victor/Documents/tweetCollection/data/"+startdate +"/"+"query1_Contains_" + str(startdate)+ "_" + str(enddate) + "_" +query+'.json', 'w') as outfile:
json.dump(tweets_list, outfile, indent = 2)
return len(tweets_list)
query = ["keyword1","keyword2","keyword3", etc. ]
for value in query:
get_tweets_withHashTags(value,"20200422","20200423")
At the time of writing this answer, Tweepy does not support version 2 of the Twitter API (although there is a Pull Request to add initial support). The api.search_full_archive is actually hitting the v1.1 Premium Full Archive search, which has a lower request and Tweet cap volume limit, so that is why you are seeing an error about exceeding your package.
For the Full Archive / All Tweets endpoint in v2 of the API, you need a Project in the Academic Track, and an App in that Project. You'll need to point your code at the /2/tweets/search/all endpoint. If you are using Python, there is sample code in the TwitterDev repo. This uses the requests library, rather than using a full API library like Tweepy.
Related
I'm using Tweepy 3.10.0 to collect tweets containing specific keywords and hashtags for a single calendar day at a time. I recently upgraded from the standard Developer Account to the Premium Account to access the full archive. I know this changes the "search" function to "search_full_archive" and changes a couple other small syntax things. I thought I made the correct changes but I'm still getting this error. I've checked the Developer API reference.
consumer_key = '****'
consumer_secret = '****'
access_token = '****'
access_token_secret = '****'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth,wait_on_rate_limit=True)
def get_tweets_withHashTags(query, startdate, enddate, count = 300):
tweets_hlist= []
tweets_list= []
qt=str(query)
for page in tweepy.Cursor(api.search_full_archive, environment_name='FullArchive', q=qt, fromDate=startdate,toDate=enddate,count=300, tweet_mode='extended').pages(100):
count = len(page)
print( "Count of tweets in each page for " + str(qt) + " : " + str(count))
for value in page:
hashList = value._json["entities"]["hashtags"]
flag = 0
for tag in hashList:
if qt.lower() in tag["text"].lower():
flag = 1
if flag==1:
tweets_hlist.append(value._json)
tweets_list.append(value._json)
print("tweets_hash_"+ query +": " + str(len(tweets_hlist)))
print("tweets_"+ query +": " + str(len(tweets_list)))
with open("/Users/Victor/Documents/tweetCollection/data/"+startdate +"/" + "query1_hash_" + str(startdate)+ "_" + str(enddate) + "_" +query+'.json', 'w') as outfile:
json.dump(tweets_hlist, outfile, indent = 2)
with open("/Users/Victor/Documents/tweetCollection/data/"+startdate +"/"+"query1_Contains_" + str(startdate)+ "_" + str(enddate) + "_" +query+'.json', 'w') as outfile:
json.dump(tweets_list, outfile, indent = 2)
return len(tweets_list)
query = ["KeyWord1","KeyWord2","KeyWord3",etc.]
for value in query:
get_tweets_withHashTags(value,"2020-04-21","2020-04-22")
According to the api's code https://github.com/tweepy/tweepy/blob/5b2dd086c2c5a08c3bf7be54400adfd823d19ea1/tweepy/api.py#L1144 api.search_full_archive has as arguments label (the environment name) and query. So changing
api.search_full_archive, environment_name='FullArchive', q=qt, fromDate=startdate,toDate=enddate,count=300, tweet_mode='extended'
to
api.search_full_archive, label='FullArchive', query=qt, fromDate=startdate,toDate=enddate
As for the tweet_mode='extended', it is not available for search_full_archive nor search_30_day. You can see how to access full text in https://github.com/tweepy/tweepy/issues/1461
I am using tweepy to download tweets about a particular topic but nobody which tutorial I follow I cannot get the tweet to output as a full tweet. There is always an ellipse that cuts it off after a certain number of characters.
Here is the code I am using
import json
import tweepy
from tweepy import OAuthHandler
import csv
import sys
from twython import Twython
nonBmpMap = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)
with open ('Twitter_Credentials.json') as cred_data:
info = json.load(cred_data)
consumer_Key = info['Consumer_Key']
consumer_Secret = info['Consumer_Secret']
access_Key = info['Access_Key']
access_Secret = info['Access_Secret']
maxTweets = int(input('Enter the Number of tweets that you want to extract '))
userTopic = input('What topic do you want to search for ')
topic = ('"' + userTopic + '"')
tweetCount = 0
auth = OAuthHandler(consumer_Key, consumer_Secret)
auth.set_access_token(access_Key, access_Secret)
api = tweepy.API(auth, wait_on_rate_limit=True)
tweets = api.search(q=topic, count=maxTweets, tweet_mode= 'extended')
for tweet in tweets:
tweetCount = (tweetCount+1)
with open ('TweetsAbout' + userTopic, 'a', encoding='utf-8') as the_File:
print(tweet.full_text.translate(nonBmpMap))
tweet = (str(tweet.full_text).translate(nonBmpMap).replace(',','').replace('|','').replace('\n','').replace('’','\'').replace('…',"end"))
the_File.write(tweet + "\n")
print('Extracted ' + str(tweetCount) + ' tweets about ' + topic)
Try this, see if it works!
try:
specific_tweets = tweepy.Cursor(api.search, tweet_mode='extended', q=<your_query_string> +" -filter:retweets", lang='en').items(500)
except tweepy.error.TweepError:
pass
for tweet in specific_tweets:
extracted_text = tweet.full_text
all the text your trying to extract should be in extracted_text. Good Luck!!
here is my code..i want to extract tweets from twitter with some keywords....my code dont give any errors but i am not getting the output file generated...please help me........
import re
import csv
import tweepy
from tweepy import OAuthHandler
#TextBlob perform simple natural language processing tasks.
from textblob import TextBlob
def search():
#text = e.get() **************************
consumer_key = ''
consumer_secret = ''
access_token = ' '
access_token_secret = ' '
# create OAuthHandler object
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
# set access token and secret
auth.set_access_token(access_token, access_token_secret)
# create tweepy API object to fetch tweets
api = tweepy.API(auth)
def get_tweets(query, count = 300):
# empty list to store parsed tweets
tweets = []
target = open("tweets.txt", 'w',encoding="utf-8")
t1 = open("review.txt", 'w',encoding="utf-8")
# call twitter api to fetch tweets
q=str(query)
a=str(q+" sarcasm")
b=str(q+" sarcastic")
c=str(q+" irony")
fetched_tweets = api.search(a, count = count)+ api.search(b, count = count)+ api.search(c, count = count)
# parsing tweets one by one
print(len(fetched_tweets))
for tweet in fetched_tweets:
# empty dictionary to store required params of a tweet
parsed_tweet = {}
# saving text of tweet
parsed_tweet['text'] = tweet.text
if "http" not in tweet.text:
line = re.sub("[^A-Za-z]", " ", tweet.text)
target.write(line+"\n")
t1.write(line+"\n")
return tweets
# creating object of TwitterClient Class
# calling function to get tweets
tweets = get_tweets(query =text, count = 20000)
root.mainloop()
From this code i am nor getting the output generated file. Can anyone tell me what i am doing wrong ?
Thanks in advance!
I just made some slight changes and it was working perfectly for me. Removed or commented some unnecessary statements (like the review file). Changed the open function to io.open since I have python version 2.7. Here is the running code, hope it helps!!
`
import re
import io
import csv
import tweepy
from tweepy import OAuthHandler
#TextBlob perform simple natural language processing tasks.
#from textblob import TextBlob
consumer_key = 'sz6x0nvL0ls9wacR64MZu23z4'
consumer_secret = 'ofeGnzduikcHX6iaQMqBCIJ666m6nXAQACIAXMJaFhmC6rjRmT'
access_token = '854004678127910913-PUPfQYxIjpBWjXOgE25kys8kmDJdY0G'
access_token_secret = 'BC2TxbhKXkdkZ91DXofF7GX8p2JNfbpHqhshW1bwQkgxN'
# create OAuthHandler object
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
# set access token and secret
auth.set_access_token(access_token, access_token_secret)
# create tweepy API object to fetch tweets
api = tweepy.API(auth)
def get_tweets(query, count = 300):
# empty list to store parsed tweets
tweets = []
target = io.open("mytweets.txt", 'w', encoding='utf-8')
# call twitter api to fetch tweets
q=str(query)
a=str(q+" sarcasm")
b=str(q+" sarcastic")
c=str(q+" irony")
fetched_tweets = api.search(a, count = count)+ api.search(b, count = count)+ api.search(c, count = count)
# parsing tweets one by one
print(len(fetched_tweets))
for tweet in fetched_tweets:
# empty dictionary to store required params of a tweet
parsed_tweet = {}
# saving text of tweet
parsed_tweet['text'] = tweet.text
if "http" not in tweet.text:
line = re.sub("[^A-Za-z]", " ", tweet.text)
target.write(line+"\n")
return tweets
# creating object of TwitterClient Class
# calling function to get tweets
tweets = get_tweets(query ="", count = 20000)
`
I am wondering how I can automate my program to fetch tweets at the max rate of 180 requests per 15 minutes, which is equivalent to the max count of 100 per request totaling 18,000 tweets. I am creating this program for an independent case study at school.
I would like my program to avoid being rate limited and end up being terminated. So, what I would like it to do is constantly use the max number of requests per 15 minutes and be able to leave it running for 24 hours without user interaction to retrieve all tweets possible for analysis.
Here is my code. It gets tweets of query and puts it into a text file but eventually gets rate limited. Would really appreciate the help
import logging
import time
import csv
import twython
import json
app_key = ""
app_secret = ""
oauth_token = ""
oauth_token_secret = ""
twitter = twython.Twython(app_key, app_secret, oauth_token, oauth_token_secret)
tweets = []
MAX_ATTEMPTS = 1000000
# Max Number of tweets per 15 minutes
COUNT_OF_TWEETS_TO_BE_FETCHED = 18000
for i in range(0,MAX_ATTEMPTS):
if(COUNT_OF_TWEETS_TO_BE_FETCHED < len(tweets)):
break
if(0 == i):
results = twitter.search(q="$AAPL",count='100',lang='en',)
else:
results = twitter.search(q="$AAPL",include_entities='true',max_id=next_max_id)
for result in results['statuses']:
print result
with open('tweets.txt', 'a') as outfile:
json.dump(result, outfile, sort_keys = True, indent = 4)
try:
next_results_url_params = results['search_metadata']['next_results']
next_max_id = next_results_url_params.split('max_id=')[1].split('&')[0]
except:
break
You should be using Twitter's Streaming API.
This will allow you to receive a near-realtime feed of your search. You can write those tweets to a file just as fast as they come in.
Using the track parameter you will be able to receive only the specific tweets you're interested in.
You'll need to use Twython Streamer - and your code will look something like this:
from twython import TwythonStreamer
class MyStreamer(TwythonStreamer):
def on_success(self, data):
if 'text' in data:
print data['text'].encode('utf-8')
def on_error(self, status_code, data):
print status_code
stream = MyStreamer(APP_KEY, APP_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
stream.statuses.filter(track='$AAPL')
Is it possible to get the full follower list of an account who has more than one million followers, like McDonald's?
I use Tweepy and follow the code:
c = tweepy.Cursor(api.followers_ids, id = 'McDonalds')
ids = []
for page in c.pages():
ids.append(page)
I also try this:
for id in c.items():
ids.append(id)
But I always got the 'Rate limit exceeded' error and there were only 5000 follower ids.
In order to avoid rate limit, you can/should wait before the next follower page request. Looks hacky, but works:
import time
import tweepy
auth = tweepy.OAuthHandler(..., ...)
auth.set_access_token(..., ...)
api = tweepy.API(auth)
ids = []
for page in tweepy.Cursor(api.followers_ids, screen_name="McDonalds").pages():
ids.extend(page)
time.sleep(60)
print len(ids)
Hope that helps.
Use the rate limiting arguments when making the connection. The api will self control within the rate limit.
The sleep pause is not bad, I use that to simulate a human and to spread out activity over a time frame with the api rate limiting as a final control.
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True, compression=True)
also add try/except to capture and control errors.
example code
https://github.com/aspiringguru/twitterDataAnalyse/blob/master/sample_rate_limit_w_cursor.py
I put my keys in an external file to make management easier.
https://github.com/aspiringguru/twitterDataAnalyse/blob/master/keys.py
I use this code and it works for a large number of followers :
there are two functions one for saving followers id after every sleep period and another one to get the list :
it is a little missy but I hope to be useful.
def save_followers_status(filename,foloowersid):
path='//content//drive//My Drive//Colab Notebooks//twitter//'+filename
if not (os.path.isfile(path+'_followers_status.csv')):
with open(path+'_followers_status.csv', 'wb') as csvfile:
filewriter = csv.writer(csvfile, delimiter=',')
if len(foloowersid)>0:
print("save followers status of ", filename)
file = path + '_followers_status.csv'
# https: // stackoverflow.com / questions / 3348460 / csv - file - written -with-python - has - blank - lines - between - each - row
with open(file, mode='a', newline='') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for row in foloowersid:
writer.writerow(np.array(row))
csv_file.closed
def get_followers_id(person):
foloowersid = []
count=0
influencer=api.get_user( screen_name=person)
influencer_id=influencer.id
number_of_followers=influencer.followers_count
print("number of followers count : ",number_of_followers,'\n','user id : ',influencer_id)
status = tweepy.Cursor(api.followers_ids, screen_name=person, tweet_mode="extended").items()
for i in range(0,number_of_followers):
try:
user=next(status)
foloowersid.append([user])
count += 1
except tweepy.TweepError:
print('error limite of twiter sleep for 15 min')
timestamp = time.strftime("%d.%m.%Y %H:%M:%S", time.localtime())
print(timestamp)
if len(foloowersid)>0 :
print('the number get until this time :', count,'all folloers count is : ',number_of_followers)
foloowersid = np.array(str(foloowersid))
save_followers_status(person, foloowersid)
foloowersid = []
time.sleep(15*60)
next(status)
except :
print('end of foloowers ', count, 'all followers count is : ', number_of_followers)
foloowersid = np.array(str(foloowersid))
save_followers_status(person, foloowersid)
foloowersid = []
save_followers_status(person, foloowersid)
# foloowersid = np.array(map(str,foloowersid))
return foloowersid
The answer from alecxe is good, however no one has referred to the docs. The correct information and explanation to answer the question lives in the Twitter API documentation. From the documentation :
Results are given in groups of 5,000 user IDs and multiple “pages” of results can be navigated through using the next_cursor value in subsequent requests.
Tweepy's "get_follower_ids()" uses https://api.twitter.com/1.1/followers/ids.json endpoint. This endpoint has a rate limit (15 requests per 15 min).
You are getting the 'Rate limit exceeded' error, cause you are crossing that threshold.
Instead of manually putting the sleep in your code you can use wait_on_rate_limit=True when creating the Tweepy API object.
Moreover, the endpoint has an optional parameter count which specifies the number of users to return per page. The Twitter API documentation does not says anything about its default value. Its maximum value is 5000.
To get the most ids per request explicitly set it to the maximum. So that you need fewer requests.
Here is my code for getting all the followers' ids:
auth = tweepy.OAuth1UserHandler(consumer_key = '', consumer_secret = '',
access_token= '', access_token_secret= '')
api = tweepy.API(auth, wait_on_rate_limit=True)
account_id = 71026122 # instead of account_id you can also use screen_name
follower_ids = []
for page in tweepy.Cursor(api.get_follower_ids, user_id = account_id, count = 5000).pages():
follower_ids.extend(page)