Is it possible to get the full follower list of an account who has more than one million followers, like McDonald's?
I use Tweepy and follow the code:
c = tweepy.Cursor(api.followers_ids, id = 'McDonalds')
ids = []
for page in c.pages():
ids.append(page)
I also try this:
for id in c.items():
ids.append(id)
But I always got the 'Rate limit exceeded' error and there were only 5000 follower ids.
In order to avoid rate limit, you can/should wait before the next follower page request. Looks hacky, but works:
import time
import tweepy
auth = tweepy.OAuthHandler(..., ...)
auth.set_access_token(..., ...)
api = tweepy.API(auth)
ids = []
for page in tweepy.Cursor(api.followers_ids, screen_name="McDonalds").pages():
ids.extend(page)
time.sleep(60)
print len(ids)
Hope that helps.
Use the rate limiting arguments when making the connection. The api will self control within the rate limit.
The sleep pause is not bad, I use that to simulate a human and to spread out activity over a time frame with the api rate limiting as a final control.
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True, compression=True)
also add try/except to capture and control errors.
example code
https://github.com/aspiringguru/twitterDataAnalyse/blob/master/sample_rate_limit_w_cursor.py
I put my keys in an external file to make management easier.
https://github.com/aspiringguru/twitterDataAnalyse/blob/master/keys.py
I use this code and it works for a large number of followers :
there are two functions one for saving followers id after every sleep period and another one to get the list :
it is a little missy but I hope to be useful.
def save_followers_status(filename,foloowersid):
path='//content//drive//My Drive//Colab Notebooks//twitter//'+filename
if not (os.path.isfile(path+'_followers_status.csv')):
with open(path+'_followers_status.csv', 'wb') as csvfile:
filewriter = csv.writer(csvfile, delimiter=',')
if len(foloowersid)>0:
print("save followers status of ", filename)
file = path + '_followers_status.csv'
# https: // stackoverflow.com / questions / 3348460 / csv - file - written -with-python - has - blank - lines - between - each - row
with open(file, mode='a', newline='') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for row in foloowersid:
writer.writerow(np.array(row))
csv_file.closed
def get_followers_id(person):
foloowersid = []
count=0
influencer=api.get_user( screen_name=person)
influencer_id=influencer.id
number_of_followers=influencer.followers_count
print("number of followers count : ",number_of_followers,'\n','user id : ',influencer_id)
status = tweepy.Cursor(api.followers_ids, screen_name=person, tweet_mode="extended").items()
for i in range(0,number_of_followers):
try:
user=next(status)
foloowersid.append([user])
count += 1
except tweepy.TweepError:
print('error limite of twiter sleep for 15 min')
timestamp = time.strftime("%d.%m.%Y %H:%M:%S", time.localtime())
print(timestamp)
if len(foloowersid)>0 :
print('the number get until this time :', count,'all folloers count is : ',number_of_followers)
foloowersid = np.array(str(foloowersid))
save_followers_status(person, foloowersid)
foloowersid = []
time.sleep(15*60)
next(status)
except :
print('end of foloowers ', count, 'all followers count is : ', number_of_followers)
foloowersid = np.array(str(foloowersid))
save_followers_status(person, foloowersid)
foloowersid = []
save_followers_status(person, foloowersid)
# foloowersid = np.array(map(str,foloowersid))
return foloowersid
The answer from alecxe is good, however no one has referred to the docs. The correct information and explanation to answer the question lives in the Twitter API documentation. From the documentation :
Results are given in groups of 5,000 user IDs and multiple “pages” of results can be navigated through using the next_cursor value in subsequent requests.
Tweepy's "get_follower_ids()" uses https://api.twitter.com/1.1/followers/ids.json endpoint. This endpoint has a rate limit (15 requests per 15 min).
You are getting the 'Rate limit exceeded' error, cause you are crossing that threshold.
Instead of manually putting the sleep in your code you can use wait_on_rate_limit=True when creating the Tweepy API object.
Moreover, the endpoint has an optional parameter count which specifies the number of users to return per page. The Twitter API documentation does not says anything about its default value. Its maximum value is 5000.
To get the most ids per request explicitly set it to the maximum. So that you need fewer requests.
Here is my code for getting all the followers' ids:
auth = tweepy.OAuth1UserHandler(consumer_key = '', consumer_secret = '',
access_token= '', access_token_secret= '')
api = tweepy.API(auth, wait_on_rate_limit=True)
account_id = 71026122 # instead of account_id you can also use screen_name
follower_ids = []
for page in tweepy.Cursor(api.get_follower_ids, user_id = account_id, count = 5000).pages():
follower_ids.extend(page)
Related
i'm trying to filter my search to accounts 'created less than 30 days ago' but I couldn't seem to find a way to make it work. I've tried a couple things using pytz and datetime but I think i'm doing it wrong.
I seen created_at and since_id in the dev api but i'm not sure how to properly call them into my searches and if they would actually give me the date's account were created.
Here is my code:
def get_keywords(yaml_path):
# Open the YAML file
with open(yaml_path, 'r') as f:
# Parse the YAML file
data = yaml.safe_load(f)
# Retrieve the list of keywords from the YAML file
keywords = data['keywords']
# Return the keywords
return keywords
# Get the keywords from the YAML file
keywords = get_keywords(yaml_path)
# Set the starting page number
page = 1
# Set the maximum number of pages to retrieve
max_pages = 100000
# Set a flag to indicate when to stop paginating
done = False
for word in keywords:
# Create a cursor for the search
cursor = tweepy.Cursor(api.search_users, q=word, count=20)
# Iterate through the pages of the search results
for page in cursor.pages(max_pages):
# Iterate through the search results and print the screen name and biography of each profile
for user in page:
# Check the number of followers
if user.followers_count < 3000:
screen_name = user.screen_name.encode('utf-8')
biography = user.description.encode('utf-8')
# Print the screen name and biography
print(f'Screen name: {screen_name}')
print(f'Biography: {biography}')
I'm currently trying to learn web scraping and decided to scrape some discord data. Code follows:
import requests
import json
def retrieve_messages(channelid):
num=0
headers = {
'authorization': 'here we enter the authorization code'
}
r = requests.get(
f'https://discord.com/api/v9/channels/{channelid}/messages?limit=100',headers=headers
)
jsonn = json.loads(r.text)
for value in jsonn:
print(value['content'], '\n')
num=num+1
print('number of messages we collected is',num)
retrieve_messages('server id goes here')
The problem: when I tried changing the limit here messages?limit=100 apparently it only accepts numbers between 0 and 100, meaning that the maximum number of messages I can get is 100. I tried changing this number to 900, for example, to scrape more messages. But then I get the error TypeError: string indices must be integers.
Any ideas on how I could get, possibly, all the messages in a channel?
Thank you very much for reading!
APIs that return a bunch of records are almost always limited to some number of items.
Otherwise, if a large quantity of items is requested, the API may fail due to being out of memory.
For that purpose, most APIs implement pagination using limit, before and after parameters where:
limit: tells you how many messages to fetch
before: get messages before this message ID
after: get messages after this message ID
Discord API is no exception as the documentation tells us.
Here's how you do it:
First, you will need to query the data multiple times.
For that, you can use a while loop.
Make sure to add an if the condition that will prevent the loop from running indefinitely - I added a check whether there are any messages left.
while True:
# ... requests code
jsonn = json.loads(r.text)
if len(jsonn) == 0:
break
for value in jsonn:
print(value['content'], '\n')
num=num+1
Define a variable that has the last message that you fetched and save the last message id that you already printed
def retrieve_messages(channelid):
last_message_id = None
while True:
# ...
for value in jsonn:
print(value['content'], '\n')
last_message_id = value['id']
num=num+1
Now on the first run the last_message_id is None, and on subsequent requests it has the last message you printed.
Use that to build your query
while True:
query_parameters = f'limit={limit}'
if last_message_id is not None:
query_parameters += f'&before={last_message_id}'
r = requests.get(
f'https://discord.com/api/v9/channels/{channelid}/messages?{query_parameters}',headers=headers
)
# ...
Note: discord servers give you the latest message first, so you have to use the before parameter
Here's a fully working example of your code
import requests
import json
def retrieve_messages(channelid):
num = 0
limit = 10
headers = {
'authorization': 'auth header here'
}
last_message_id = None
while True:
query_parameters = f'limit={limit}'
if last_message_id is not None:
query_parameters += f'&before={last_message_id}'
r = requests.get(
f'https://discord.com/api/v9/channels/{channelid}/messages?{query_parameters}',headers=headers
)
jsonn = json.loads(r.text)
if len(jsonn) == 0:
break
for value in jsonn:
print(value['content'], '\n')
last_message_id = value['id']
num=num+1
print('number of messages we collected is',num)
retrieve_messages('server id here')
To answer this question, we must look at the discord API. Googling "discord api get messages" gets us the developer reference for the discord API. The particular endpoint you are using is documented here:
https://discord.com/developers/docs/resources/channel#get-channel-messages
The limit is documented here, along with the around, before, and after parameters. Using one of these parameters (most likely after) we can paginate the results.
In pseudocode, it would look something like this:
offset = 0
limit = 100
all_messages=[]
while True:
r = requests.get(
f'https://discord.com/api/v9/channels/{channelid}/messages?limit={limit}&after={offset}',headers=headers
)
all_messages.append(extract messages from response)
if (number of responses < limit):
break # We have reached the end of all the messages, exit the loop
else:
offset += limit
By the way, you will probably want to print(r.text) right after the response comes in so you can see what the response looks like. It will save a lot of confusion.
Here is my solution. Feedback is welcome as I'm newish to Python. Kindly provide me w/ credit/good-luck if using this. Thank you =)
import requests
CHANNELID = 'REPLACE_ME'
HEADERS = {'authorization': 'REPLACE_ME'}
LIMIT=100
all_messages = []
r = requests.get(f'https://discord.com/api/v9/channels/{CHANNELID}/messages?limit={LIMIT}',headers=HEADERS)
all_messages.extend(r.json())
print(f'len(r.json()) is {len(r.json())}','\n')
while len(r.json()) == LIMIT:
last_message_id = r.json()[-1].get('id')
r = requests.get(f'https://discord.com/api/v9/channels/{CHANNELID}/messages?limit={LIMIT}&before={last_message_id}',headers=HEADERS)
all_messages.extend(r.json())
print(f'len(r.json()) is {len(r.json())} and last_message_id is {last_message_id} and len(all_messages) is {len(all_messages)}')
I want scheduled to run my python script every hour and save the data in elasticsearch index. So that I used a function I wrote, set_interval which uses the tweepy library. But it doesn't work as I need it to work. It runs every minute and save the data in index. Even after the set that seconds equal to 3600 it runs in every minute. But I want to configure this to run on an hourly basis.
How can I fix this? Heres my python script:
def call_at_interval(time, callback, args):
while True:
timer = Timer(time, callback, args=args)
timer.start()
timer.join()
def set_interval(time, callback, *args):
Thread(target=call_at_interval, args=(time, callback, args)).start()
def get_all_tweets(screen_name):
# authorize twitter, initialize tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
screen_name = ""
# initialize a list to hold all the tweepy Tweets
alltweets = []
# make initial request for most recent tweets (200 is the maximum allowed count)
new_tweets = api.user_timeline(screen_name=screen_name, count=200)
# save most recent tweets
alltweets.extend(new_tweets)
# save the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
# keep grabbing tweets until there are no tweets left to grab
while len(new_tweets) > 0:
#print
#"getting tweets before %s" % (oldest)
# all subsiquent requests use the max_id param to prevent duplicates
new_tweets = api.user_timeline(screen_name=screen_name, count=200, max_id=oldest)
# save most recent tweets
alltweets.extend(new_tweets)
# update the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
#print
#"...%s tweets downloaded so far" % (len(alltweets))
outtweets = [{'ID': tweet.id_str, 'Text': tweet.text, 'Date': tweet.created_at, 'author': tweet.user.screen_name} for tweet in alltweets]
def save_es(outtweets, es): # Peps8 convention
data = [ # Please without s in data
{
"_index": "index name",
"_type": "type name",
"_id": index,
"_source": ID
}
for index, ID in enumerate(outtweets)
]
helpers.bulk(es, data)
save_es(outtweets, es)
print('Run at:')
print(datetime.now())
print("\n")
set_interval(3600, get_all_tweets(screen_name))
Why do you need so much complexity to do some task every hour? You can run script every one hour this way below, note that it is runned 1 hour + time to do work:
import time
def do_some_work():
print("Do some work")
time.sleep(1)
print("Some work is done!")
if __name__ == "__main__":
time.sleep(60) # imagine you would like to start work in 1 minute first time
while True:
do_some_work()
time.sleep(3600) # do work every one hour
If you want to run script exactly every one hour, do the following code below:
import time
import threading
def do_some_work():
print("Do some work")
time.sleep(4)
print("Some work is done!")
if __name__ == "__main__":
time.sleep(60) # imagine you would like to start work in 1 minute first time
while True:
thr = threading.Thread(target=do_some_work)
thr.start()
time.sleep(3600) # do work every one hour
In this case thr is supposed to finish it's work faster than 3600 seconds, though it does not, you'll still get results, but results will be from another attempt, see the example below:
import time
import threading
class AttemptCount:
def __init__(self, attempt_number):
self.attempt_number = attempt_number
def do_some_work(_attempt_number):
print(f"Do some work {_attempt_number.attempt_number}")
time.sleep(4)
print(f"Some work is done! {_attempt_number.attempt_number}")
_attempt_number.attempt_number += 1
if __name__ == "__main__":
attempt_number = AttemptCount(1)
time.sleep(1) # imagine you would like to start work in 1 minute first time
while True:
thr = threading.Thread(target=do_some_work, args=(attempt_number, ),)
thr.start()
time.sleep(1) # do work every one hour
The result you'll gey in the case is:
Do some work 1
Do some work 1
Do some work 1
Do some work 1
Some work is done! 1
Do some work 2
Some work is done! 2
Do some work 3
Some work is done! 3
Do some work 4
Some work is done! 4
Do some work 5
Some work is done! 5
Do some work 6
Some work is done! 6
Do some work 7
Some work is done! 7
Do some work 8
Some work is done! 8
Do some work 9
I like using subprocess.Popen for such tasks, if the child subprocess did not finish it's work within one hour due to any reason, you just terminate it and start a new one.
You also can use CRON to schedule some process to run every one hour.
Get rid of all timer code just write the logic and
cron will do the job for you add this to the end of the file after crontab -e
0 * * * * /path/to/python /path/to/script.py
0 * * * * means run at every zero minute you can find more explanation here
And also I noticed you are recursively calling get_all_tweets(screen_name) I think you might have to call it from outside
Just keep your script this much
def get_all_tweets(screen_name):
# authorize twitter, initialize tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
screen_name = ""
# initialize a list to hold all the tweepy Tweets
alltweets = []
# make initial request for most recent tweets (200 is the maximum allowed count)
new_tweets = api.user_timeline(screen_name=screen_name, count=200)
# save most recent tweets
alltweets.extend(new_tweets)
# save the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
# keep grabbing tweets until there are no tweets left to grab
while len(new_tweets) > 0:
#print
#"getting tweets before %s" % (oldest)
# all subsiquent requests use the max_id param to prevent duplicates
new_tweets = api.user_timeline(screen_name=screen_name, count=200, max_id=oldest)
# save most recent tweets
alltweets.extend(new_tweets)
# update the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
#print
#"...%s tweets downloaded so far" % (len(alltweets))
outtweets = [{'ID': tweet.id_str, 'Text': tweet.text, 'Date': tweet.created_at, 'author': tweet.user.screen_name} for tweet in alltweets]
def save_es(outtweets, es): # Peps8 convention
data = [ # Please without s in data
{
"_index": "index name",
"_type": "type name",
"_id": index,
"_source": ID
}
for index, ID in enumerate(outtweets)
]
helpers.bulk(es, data)
save_es(outtweets, es)
get_all_tweets("") #your screen name here
You can only retrieve 100 user objects per request with the
api.lookup_users() method. Is there an easy way to retrieve more than 100 using Tweepy and Python? I have read this post: User ID to Username tweepy but it does not help with the more than 100 problem. I am pretty novice in Python so I cannot come up with a solution myself. What I have tried is this:
users = []
i = 0
num_pages = 2
while i < num_pages:
try:
# Look up a collection of ids
users.append(api.lookup_users(user_ids=ids[100*i:100*(i+1)-1]))
except tweepy.TweepError:
# We get a tweep error
print('Something went wrong, quitting...')
i = i + 1
where ids is a list containing the ids, but I get IndexError: list index out of range when I try to get a user with index higher than 100. If it helps I am only interested in getting the screen names from the user ids.
You're right that you need to send the tweets to the API in batches of 100, but you're ignoring the fact that you might not have an exact multiple of 100 tweets. Try the following:
import tweepy
def lookup_user_list(user_id_list, api):
full_users = []
users_count = len(user_id_list)
try:
for i in range((users_count / 100) + 1):
full_users.extend(api.lookup_users(user_ids=user_id_list[i*100:min((i+1)*100, users_count)]))
return full_users
except tweepy.TweepError:
print 'Something went wrong, quitting...'
results = lookup_user_list(ids, api)
By taking the minimum of results = lookup_user_list(user_ids, main_api) we ensure the final loop only gets the users left over. results will be a list of the looked-up users.
You may also hit rate limits - when setting up your API, you should take care to let tweepy catch these gracefully and remove some of the hard work, like so:
consumer_key = 'X'
consumer_secret = 'X'
access_token = 'X'
access_token_secret = 'X'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
I haven't tested it since I don't have access to the API.
But if you have a collection of user ids in any range, this should fetch all of them.
It fetches any remainder first, meaning if you have a list of 250 ids, it will fetch 50 users with the last 50 ids in the list.
Then it will fetch the remaining 200 users in batches of hundreds.
from tweepy import api, TweepError
users = []
user_ids = [] # collection of user ids
count_100 = int(len(user_ids) / 100) # amount of hundred user ids
if len(user_ids) % 100 > 0:
for i in range(0, count_100 + 1):
try:
if i == 0:
remainder = len(user_ids) % 100
users.append(api.lookup_users(user_ids=user_ids[:-remainder]))
else:
end_at = i * 100
start_at = end_at - 100
users.append(api.lookup_users(user_ids=user_ids[start_at:end_at]))
except TweepError:
print('Something went wrong, quitting...')
I am wondering how I can automate my program to fetch tweets at the max rate of 180 requests per 15 minutes, which is equivalent to the max count of 100 per request totaling 18,000 tweets. I am creating this program for an independent case study at school.
I would like my program to avoid being rate limited and end up being terminated. So, what I would like it to do is constantly use the max number of requests per 15 minutes and be able to leave it running for 24 hours without user interaction to retrieve all tweets possible for analysis.
Here is my code. It gets tweets of query and puts it into a text file but eventually gets rate limited. Would really appreciate the help
import logging
import time
import csv
import twython
import json
app_key = ""
app_secret = ""
oauth_token = ""
oauth_token_secret = ""
twitter = twython.Twython(app_key, app_secret, oauth_token, oauth_token_secret)
tweets = []
MAX_ATTEMPTS = 1000000
# Max Number of tweets per 15 minutes
COUNT_OF_TWEETS_TO_BE_FETCHED = 18000
for i in range(0,MAX_ATTEMPTS):
if(COUNT_OF_TWEETS_TO_BE_FETCHED < len(tweets)):
break
if(0 == i):
results = twitter.search(q="$AAPL",count='100',lang='en',)
else:
results = twitter.search(q="$AAPL",include_entities='true',max_id=next_max_id)
for result in results['statuses']:
print result
with open('tweets.txt', 'a') as outfile:
json.dump(result, outfile, sort_keys = True, indent = 4)
try:
next_results_url_params = results['search_metadata']['next_results']
next_max_id = next_results_url_params.split('max_id=')[1].split('&')[0]
except:
break
You should be using Twitter's Streaming API.
This will allow you to receive a near-realtime feed of your search. You can write those tweets to a file just as fast as they come in.
Using the track parameter you will be able to receive only the specific tweets you're interested in.
You'll need to use Twython Streamer - and your code will look something like this:
from twython import TwythonStreamer
class MyStreamer(TwythonStreamer):
def on_success(self, data):
if 'text' in data:
print data['text'].encode('utf-8')
def on_error(self, status_code, data):
print status_code
stream = MyStreamer(APP_KEY, APP_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
stream.statuses.filter(track='$AAPL')