I am trying to get all tweets from a specific user:
def get_all_tweets(user_id, DEBUG):
# Your bearer token here
t = Twarc2(bearer_token="blah")
# Initialize a list to hold all the tweepy Tweets
alltweets = []
new_tweets = {}
if DEBUG:
# Debug: read from file
f = open('tweets_debug.txt',)
new_tweets = json.load(f)
alltweets.extend(new_tweets)
else:
# make initial request for most recent tweets (3200 is the maximum allowed count)
new_tweets = t.timeline(user=user_id)
# save most recent tweets
alltweets.extend(new_tweets)
if DEBUG:
# Debug: write to file
f = open("tweets_debug.txt", "w")
f.write(json.dumps(alltweets, indent=2, sort_keys=False))
f.close()
# Save the id of the oldest tweet less one
oldest = str(int(alltweets[-1]['meta']['oldest_id']) - 1)
# Keep grabbing tweets until there are no tweets left to grab
while len(dict(new_tweets)) > 0:
print(f"getting tweets before {oldest}")
# All subsiquent requests use the max_id param to prevent duplicates
new_tweets = t.timeline(user=user_id,until_id=oldest)
# Save most recent tweets
alltweets.extend(new_tweets)
# Update the id of the oldest tweet less one
oldest = str(int(alltweets[-1]['meta']['oldest_id']) - 1)
print(f"...{len(alltweets)} tweets downloaded so far")
res = []
for tweetlist in alltweets:
res.extend(tweetlist['data'])
f = open("output.txt", "w")
f.write(json.dumps(res, indent=2, sort_keys=False))
f.close()
return res
However, len(dict(new_tweets)) does not work. It always returns 0. sum(1 for dummy in new_tweets) also returns 0.
I tried json.load(new_tweets) and it does not work as well.
However, alltweets.extend(new_tweets) worked properly.
It seems like timeline() returns a generator-type value (<generator object Twarc2._timeline at 0x000001D78B3D8B30>). Is there any way I can count its length to determine whether there are any more tweets un-grabbed?
Or, is there any way to merge...
someList = []
someList.extend(new_tweets)
while len(someList) > 0:
# blah blah
...into one line with while?
Edit: I tried print(list(new_tweets)) right before the while loop, and it returns []. It seems like the object is actually empty.
Is it because alltweets.extend(new_tweets) somehow consumes the new_tweets generator...?
I figured it out myself. This problem can be solved by converting generator to list:
new_tweets = list(t.timeline(user=user_id,until_id=oldest))
Related
I am trying to retrieve about 1000 tweets from a search term like 'NFL' using tweepy and storing the tweets into a DataFrame using pandas. My issue is I can't find a way to remove duplicated tweets, I have tried df.drop_duplicates but it only gives me about 100 tweets to work with. Help would be appreciated!
num_needed = 1000
tweet_list = [] # Lists to be added as columns( Tweets, usernames, and screen names) in our dataframe
user_list = []
screen_name_list = []
last_id = -1 # ID of last tweet seen
while len(tweet_list) < num_needed:
try:
new_tweets = api.search(q = 'NFL', count = num_needed, max_id = str(last_id - 1), lang = 'en', tweet_mode = 'extended') # This is the criteria for collecting the tweets that I want. I want to make sure the results are as accurate as possible when making a final analysis.
except tweepy.TweepError as e:
print("Error", e)
break
else:
if not new_tweets:
print("Could not find any more tweets!")
break
else:
for tweet in new_tweets:
# Fetching the screen name and username
screen_name = tweet.author.screen_name
user_name = tweet.author.name
tweet_text = tweet.full_text
tweet_list.append(tweet_text)
user_list.append(user_name)
screen_name_list.append(screen_name)
df = pd.DataFrame() #Create a new dataframe (df) with new columns
df['Screen name'] = screen_name_list
df['Username'] = user_list
df['Tweets'] = tweet_list
Well, yes, when you use .drop_duplicates(), you only get 100 tweets because that's how many duplicates there are. Doesn't matter what technique you use here, there are 900 or so duplicates with how your code runs.
So you might be asking, why? It by default returns only 100 tweets, which I am assuming you are aware of since you are looping and you try to get more by using the max_id parameter. However, your max_id, is always -1 here, you never get the id and thus never change that parameter. So one thing you can do, is while you iterate through the tweets, also collect the ids. Then after you get all the ids, store the minimum id value as last_id, then it'll work in your loop:
Code:
num_needed = 1000
tweet_list = [] # Lists to be added as columns( Tweets, usernames, and screen names) in our dataframe
user_list = []
screen_name_list = []
tw_id = [] #<-- ADDED THIS
last_id = -1 # ID of last tweet seen
while len(tweet_list) < num_needed:
try:
new_tweets = api.search(q = 'NFL -filter:retweets', count = num_needed, max_id = str(last_id - 1), lang = 'en', tweet_mode = 'extended') # This is the criteria for collecting the tweets that I want. I want to make sure the results are as accurate as possible when making a final analysis.
except tweepy.TweepError as e:
print("Error", e)
break
else:
if not new_tweets:
print("Could not find any more tweets!")
break
else:
for tweet in new_tweets:
# Fetching the screen name and username
screen_name = tweet.author.screen_name
user_name = tweet.author.name
tweet_text = tweet.full_text
tweet_list.append(tweet_text)
user_list.append(user_name)
screen_name_list.append(screen_name)
tw_id.append(tweet.id) #<-- ADDED THIS
last_id = min(tw_id) #<-- ADDED THIS
df = pd.DataFrame({'Screen name':screen_name_list,
'Username':user_list,
'Tweets':tweet_list})
df = df.drop_duplicates()
This returns to me aprox 1000 tweets.
Output:
print (len(df))
1084
I'm trying to extract tweets using the following code, and I just realized I'm only getting the first 140 characters. I'm a bit new at this and now I need to put tweet_mode=extended and full_text somewhere, so if someone could point out exactly where I'd be very appreciative. Thank you!
#!/usr/bin/env python
encoding: utf-8
import tweepy #https://github.com/tweepy/tweepy
import csv
#Twitter API credentials
consumer_key = "5f55VEYRnHuBvVESy11OrBayI"
consumer_secret = "r0PcvNast4FLYD1HNQiJIsIDGtk72hhVFPzR3BfrIWfuSn2SWD"
access_key = "949748064985722880-Wpc3hErpGEeDC75MBfcDoo07X9WVcAo"
access_secret = "w02RdHMg1izgaFlKUJH3C5s9cDNue2h8XJv87E3TE0Whm"
def get_all_tweets(screen_name):
#Twitter only allows access to a users most recent 3240 tweets with
this method
#authorize twitter, initialize tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
#initialize a list to hold all the tweepy Tweets
alltweets = []
#make initial request for most recent tweets (200 is the maximum
allowed count)
new_tweets = api.user_timeline(screen_name = screen_name,count=200,)
#save most recent tweets
alltweets.extend(new_tweets)
#save the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
#keep grabbing tweets until there are no tweets left to grab
while len(new_tweets) > 0:
print "getting tweets before %s" % (oldest)
#all subsiquent requests use the max_id param to prevent duplicates
new_tweets = api.user_timeline(screen_name =
screen_name,count=200,max_id=oldest)
#save most recent tweets
alltweets.extend(new_tweets)
#update the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
print "...%s tweets downloaded so far" % (len(alltweets))
#transform the tweepy tweets into a 2D array that will populate the csv
outtweets = [[tweet.id_str, tweet.created_at, tweet.text.encode("utf-
8"),tweet.retweet_count,tweet.favorite_count] for tweet in alltweets]
#write the csv
with open('%s_tweets.csv' % screen_name, 'wb') as f:
writer = csv.writer(f)
writer.writerow(["id","created_at","full_text","retweet_count","favorite_count"])
writer.writerows(outtweets)
pass
if __name__ == '__main__':
#pass in the username of the account you want to download
get_all_tweets("realdonaldtrump")
Put "tweet_mode=extended" here:
new_tweets = api.user_timeline(screen_name = screen_name,
count=200,
tweet_mode=extended)
And here:
while len(new_tweets) > 0:
new_tweets = api.user_timeline(screen_name = screen_name,
count=200,
max_id=oldest,
tweet_mode=extended)
Put "full_tweet" here:
outtweets = [[tweet.id_str,
tweet.created_at,
tweet.full_tweet.encode("utf-8"),
tweet.retweet_count,
tweet.favorite_count] for tweet in alltweets]
I am trying to stream live tweets with a given hashtag using tweepy library. I am using the following code taken from https://galeascience.wordpress.com/2016/03/18/collecting-twitter-data-with-python/
I am new to python coding and APIs
import tweepy
from tweepy import OAuthHandler
import json
import datetime as dt
import time
import os
import sys
def load_api():
''' Function that loads the twitter API after authorizing the user. '''
consumer_key = 'xxx'
consumer_secret = 'xxx'
access_token = 'yyy'
access_secret = 'yyy'
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
# load the twitter API via tweepy
return tweepy.API(auth)
def tweet_search(api, query, max_tweets, max_id, since_id, geocode):
''' Function that takes in a search string 'query', the maximum
number of tweets 'max_tweets', and the minimum (i.e., starting)
tweet id. It returns a list of tweepy.models.Status objects. '''
searched_tweets = []
while len(searched_tweets) < max_tweets:
remaining_tweets = max_tweets - len(searched_tweets)
try:
new_tweets = api.search(q=query, count=remaining_tweets,
since_id=str(since_id),
max_id=str(max_id-1))
# geocode=geocode)
print('found',len(new_tweets),'tweets')
if not new_tweets:
print('no tweets found')
break
searched_tweets.extend(new_tweets)
max_id = new_tweets[-1].id
except tweepy.TweepError:
print('exception raised, waiting 15 minutes')
print('(until:', dt.datetime.now()+dt.timedelta(minutes=15), ')')
time.sleep(15*60)
break # stop the loop
return searched_tweets, max_id
def get_tweet_id(api, date='', days_ago=7, query='a'):
''' Function that gets the ID of a tweet. This ID can then be
used as a 'starting point' from which to search. The query is
required and has been set to a commonly used word by default.
The variable 'days_ago' has been initialized to the maximum
amount we are able to search back in time (9).'''
if date:
# return an ID from the start of the given day
td = date + dt.timedelta(days=1)
tweet_date = '{0}-{1:0>2}-{2:0>2}'.format(td.year, td.month, td.day)
tweet = api.search(q=query, count=1, until=tweet_date)
else:
# return an ID from __ days ago
td = dt.datetime.now() - dt.timedelta(days=days_ago)
tweet_date = '{0}-{1:0>2}-{2:0>2}'.format(td.year, td.month, td.day)
# get list of up to 10 tweets
tweet = api.search(q=query, count=10, until=tweet_date)
print('search limit (start/stop):',tweet[0].created_at)
# return the id of the first tweet in the list
return tweet[0].id
def write_tweets(tweets, filename):
''' Function that appends tweets to a file. '''
with open(filename, 'a') as f:
for tweet in tweets:
json.dump(tweet._json, f)
f.write('\n')
def main():
''' This is a script that continuously searches for tweets
that were created over a given number of days. The search
dates and search phrase can be changed below. '''
''' search variables: '''
search_phrases = ['#Messi']
time_limit = 1.5 # runtime limit in hours
max_tweets = 200 # number of tweets per search (will be
# iterated over) - maximum is 100
min_days_old, max_days_old = 1, 5 # search limits e.g., from 7 to 8
# gives current weekday from last week,
# min_days_old=0 will search from right now
# loop over search items,
# creating a new file for each
for search_phrase in search_phrases:
print('Search phrase =', search_phrase)
''' other variables '''
name = search_phrase.split()[0]
json_file_root = name + '/' + name
os.makedirs(os.path.dirname(json_file_root), exist_ok=True)
read_IDs = False
# open a file in which to store the tweets
if max_days_old - min_days_old == 1:
d = dt.datetime.now() - dt.timedelta(days=min_days_old)
day = '{0}-{1:0>2}-{2:0>2}'.format(d.year, d.month, d.day)
else:
d1 = dt.datetime.now() - dt.timedelta(days=max_days_old-1)
d2 = dt.datetime.now() - dt.timedelta(days=min_days_old)
day = '{0}-{1:0>2}-{2:0>2}_to_{3}-{4:0>2}-{5:0>2}'.format(
d1.year, d1.month, d1.day, d2.year, d2.month, d2.day)
json_file = json_file_root + '_' + day + '.json'
if os.path.isfile(json_file):
print('Appending tweets to file named: ',json_file)
read_IDs = True
# authorize and load the twitter API
api = load_api()
# set the 'starting point' ID for tweet collection
if read_IDs:
# open the json file and get the latest tweet ID
with open(json_file, 'r') as f:
lines = f.readlines()
max_id = json.loads(lines[-1])['id']
print('Searching from the bottom ID in file')
else:
# get the ID of a tweet that is min_days_old
if min_days_old == 0:
max_id = -1
else:
max_id = get_tweet_id(api, days_ago=(min_days_old-1))
# set the smallest ID to search for
since_id = get_tweet_id(api, days_ago=(max_days_old-1))
print('max id (starting point) =', max_id)
print('since id (ending point) =', since_id)
''' tweet gathering loop '''
start = dt.datetime.now()
end = start + dt.timedelta(hours=time_limit)
count, exitcount = 0, 0
while dt.datetime.now() < end:
count += 1
print('count =',count)
# collect tweets and update max_id
tweets, max_id = tweet_search(api, search_phrase, max_tweets,
max_id=max_id, since_id=since_id,
geocode=USA)
# write tweets to file in JSON format
if tweets:
write_tweets(tweets, json_file)
exitcount = 0
else:
exitcount += 1
if exitcount == 3:
if search_phrase == search_phrases[-1]:
sys.exit('Maximum number of empty tweet strings reached - exiting')
else:
print('Maximum number of empty tweet strings reached - breaking')
break
if __name__ == "__main__":
main()
It throws the following error:
Traceback (most recent call last):
File "search.py", line 189, in <module>
main()
File "search.py", line 157, in main
since_id = get_tweet_id(api, days_ago=(max_days_old-1))
File "search.py", line 80, in get_tweet_id
tweet = api.search(q=query, count=10, until=tweet_date)
File "/usr/local/lib/python3.5/dist-packages/tweepy/binder.py", line 245, in _call
return method.execute()
File "/usr/local/lib/python3.5/dist-packages/tweepy/binder.py", line 229, in execute
raise TweepError(error_msg, resp, api_code=api_error_code)
tweepy.error.TweepError: [{'code': 215, 'message': 'Bad Authentication data.'}]
I entered the relevant tokens but still it doesn't work. Any help will be appreciated.
It's rare, but it happens sometimes that the application keys need to be regenerated because of something (?) on the back end. I don't know if that's your issue, but it's worth trying.
Also, you are not actually streaming tweets. There is another request for that. You are using Twitter's REST API for searching for tweets that have already occurred.
I gathered a bunch of tweets for analysis with python. But upon trying to open up the text extension file I received this error message. I don't know if maybe something is wrong the schema of the tweets that I collected.
JSONDecodeError: Extra data: line 2 column 1 (char 12025)
Here is the code that I compiled:
with open ('tweets1.json') as dakota_file:
dakota_j=json.loads(dakota_file.read())
Please see code:
import sys
import jsonpickle
import os
searchQuery = '#Dakota-Access-Pipeline' # this is what we're searching for
#maxTweets = 10000000 # Some arbitrary large number
maxTweets=6000
tweetsPerQry = 100 # this is the max the API permits
#fName = 'tweets.txt' # We'll store the tweets in a text file.
fName='tweets.json'
# If results from a specific ID onwards are reqd, set since_id to that ID.
# else default to no lower limit, go as far back as API allows
sinceId = None
# If results only below a specific ID are, set max_id to that ID.
# else default to no upper limit, start from the most recent tweet matching the search query.
max_id = -10000000
tweetCount = 0
print("Downloading max {0} tweets".format(maxTweets))
with open(fName, 'w') as f:
while tweetCount < maxTweets:
try:
if (max_id <= 0):
if (not sinceId):
new_tweets = api.search(q=searchQuery, count=tweetsPerQry)
else:
new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
since_id=sinceId)
else:
if (not sinceId):
new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
max_id=str(max_id - 1))
else:
new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
max_id=str(max_id - 1),
since_id=sinceId)
if not new_tweets:
print("No more tweets found")
break
for tweet in new_tweets:
f.write(jsonpickle.encode(tweet._json, unpicklable=False) +
'\n')
tweetCount += len(new_tweets)
print("Downloaded {0} tweets".format(tweetCount))
max_id = new_tweets[-1].id
except tweepy.TweepError as e:
# Just exit if any error
print("some error : " + str(e))
break
print ("Downloaded {0} tweets, Saved to {1}".format(tweetCount, fName))
Is it possible to get the full follower list of an account who has more than one million followers, like McDonald's?
I use Tweepy and follow the code:
c = tweepy.Cursor(api.followers_ids, id = 'McDonalds')
ids = []
for page in c.pages():
ids.append(page)
I also try this:
for id in c.items():
ids.append(id)
But I always got the 'Rate limit exceeded' error and there were only 5000 follower ids.
In order to avoid rate limit, you can/should wait before the next follower page request. Looks hacky, but works:
import time
import tweepy
auth = tweepy.OAuthHandler(..., ...)
auth.set_access_token(..., ...)
api = tweepy.API(auth)
ids = []
for page in tweepy.Cursor(api.followers_ids, screen_name="McDonalds").pages():
ids.extend(page)
time.sleep(60)
print len(ids)
Hope that helps.
Use the rate limiting arguments when making the connection. The api will self control within the rate limit.
The sleep pause is not bad, I use that to simulate a human and to spread out activity over a time frame with the api rate limiting as a final control.
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True, compression=True)
also add try/except to capture and control errors.
example code
https://github.com/aspiringguru/twitterDataAnalyse/blob/master/sample_rate_limit_w_cursor.py
I put my keys in an external file to make management easier.
https://github.com/aspiringguru/twitterDataAnalyse/blob/master/keys.py
I use this code and it works for a large number of followers :
there are two functions one for saving followers id after every sleep period and another one to get the list :
it is a little missy but I hope to be useful.
def save_followers_status(filename,foloowersid):
path='//content//drive//My Drive//Colab Notebooks//twitter//'+filename
if not (os.path.isfile(path+'_followers_status.csv')):
with open(path+'_followers_status.csv', 'wb') as csvfile:
filewriter = csv.writer(csvfile, delimiter=',')
if len(foloowersid)>0:
print("save followers status of ", filename)
file = path + '_followers_status.csv'
# https: // stackoverflow.com / questions / 3348460 / csv - file - written -with-python - has - blank - lines - between - each - row
with open(file, mode='a', newline='') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for row in foloowersid:
writer.writerow(np.array(row))
csv_file.closed
def get_followers_id(person):
foloowersid = []
count=0
influencer=api.get_user( screen_name=person)
influencer_id=influencer.id
number_of_followers=influencer.followers_count
print("number of followers count : ",number_of_followers,'\n','user id : ',influencer_id)
status = tweepy.Cursor(api.followers_ids, screen_name=person, tweet_mode="extended").items()
for i in range(0,number_of_followers):
try:
user=next(status)
foloowersid.append([user])
count += 1
except tweepy.TweepError:
print('error limite of twiter sleep for 15 min')
timestamp = time.strftime("%d.%m.%Y %H:%M:%S", time.localtime())
print(timestamp)
if len(foloowersid)>0 :
print('the number get until this time :', count,'all folloers count is : ',number_of_followers)
foloowersid = np.array(str(foloowersid))
save_followers_status(person, foloowersid)
foloowersid = []
time.sleep(15*60)
next(status)
except :
print('end of foloowers ', count, 'all followers count is : ', number_of_followers)
foloowersid = np.array(str(foloowersid))
save_followers_status(person, foloowersid)
foloowersid = []
save_followers_status(person, foloowersid)
# foloowersid = np.array(map(str,foloowersid))
return foloowersid
The answer from alecxe is good, however no one has referred to the docs. The correct information and explanation to answer the question lives in the Twitter API documentation. From the documentation :
Results are given in groups of 5,000 user IDs and multiple “pages” of results can be navigated through using the next_cursor value in subsequent requests.
Tweepy's "get_follower_ids()" uses https://api.twitter.com/1.1/followers/ids.json endpoint. This endpoint has a rate limit (15 requests per 15 min).
You are getting the 'Rate limit exceeded' error, cause you are crossing that threshold.
Instead of manually putting the sleep in your code you can use wait_on_rate_limit=True when creating the Tweepy API object.
Moreover, the endpoint has an optional parameter count which specifies the number of users to return per page. The Twitter API documentation does not says anything about its default value. Its maximum value is 5000.
To get the most ids per request explicitly set it to the maximum. So that you need fewer requests.
Here is my code for getting all the followers' ids:
auth = tweepy.OAuth1UserHandler(consumer_key = '', consumer_secret = '',
access_token= '', access_token_secret= '')
api = tweepy.API(auth, wait_on_rate_limit=True)
account_id = 71026122 # instead of account_id you can also use screen_name
follower_ids = []
for page in tweepy.Cursor(api.get_follower_ids, user_id = account_id, count = 5000).pages():
follower_ids.extend(page)