I'm trying to use Tweepy to get the full list of followers from an account with like 500k followers, and I have a code that gives me the usernames for smaller accounts, like under 100, but if I get one that's even like 110 followers, it doesn't work. Any help figuring out how to make it work with larger numbers is greatly appreciated!
Here's the code I have right now:
import tweepy
import time
key1 = "..."
key2 = "..."
key3 = "..."
key4 = "..."
accountvar = raw_input("Account name: ")
auth = tweepy.OAuthHandler(key1, key2)
auth.set_access_token(key3, key4)
api = tweepy.API(auth)
ids = []
for page in tweepy.Cursor(api.followers_ids, screen_name=accountvar).pages():
ids.extend(page)
time.sleep(60)
users = api.lookup_users(user_ids=ids)
for u in users:
print u.screen_name
The error I keep getting is:
Traceback (most recent call last):
File "test.py", line 24, in <module>
users = api.lookup_users(user_ids=ids)
File "/Library/Python/2.7/site-packages/tweepy/api.py", line 321, in lookup_users
return self._lookup_users(post_data=post_data)
File "/Library/Python/2.7/site-packages/tweepy/binder.py", line 239, in _call
return method.execute()
File "/Library/Python/2.7/site-packages/tweepy/binder.py", line 223, in execute
raise TweepError(error_msg, resp)
tweepy.error.TweepError: [{u'message': u'Too many terms specified in query.', u'code': 18}]
I've looked at a bunch of other questions about this type of question, but none I could find had a solution that worked for me, but if someone has a link to a solution, please send it to me!
I actually figured it out, so I'll post the solution here just for reference.
import tweepy
import time
key1 = "..."
key2 = "..."
key3 = "..."
key4 = "..."
accountvar = raw_input("Account name: ")
auth = tweepy.OAuthHandler(key1, key2)
auth.set_access_token(key3, key4)
api = tweepy.API(auth)
users = tweepy.Cursor(api.followers, screen_name=accountvar).items()
while True:
try:
user = next(users)
except tweepy.TweepError:
time.sleep(60*15)
user = next(users)
except StopIteration:
break
print "#" + user.screen_name
This stops after every 300 names for 15 minutes, and then continues. This makes sure that it doesn't run into problems. This will obviously take ages for large accounts, but as Leb mentioned:
The twitter API only allows 100 users to be searched for at a time...[so] what you'll need to do is iterate through each 100 users but staying within the rate limit.
You basically just have to leave the program running if you want the next set. I don't know why mine is giving 300 at a time instead of 100, but as I mentioned about my program earlier, it was giving me 100 earlier as well.
Hope this helps anyone else that had the same problem as me, and shoutout to Leb for reminding me to focus on the rate limit.
To extend upon this:
You can harvest 3,000 users per 15 minutes by adding a count parameter:
users = tweepy.Cursor(api.followers, screen_name=accountvar, count=200).items()
This will call the Twitter API 15 times as per your version, but rather than the default count=20, each API call will return 200 (i.e. you get 3000 rather than 300).
Twitter provides two ways to fetch the followers: -
Fetching full followers list (using followers/list in Twitter API
or api.followers in tweepy) - Alec and mataxu have provided the
approach to fetch using this way in their answers. The rate limit
with this is you can get at most 200 * 15 = 3000 followers in every
15 minutes window.
Second approach involves two stages:-
a) Fetching only the followers ids first (using followers/ids in
Twitter API or api.followers_ids in tweepy).you can get 5000 *
15 = 75K follower ids in each 15 minutes window.
b) Looking up
their usernames or other data (using users/lookup in twitter api or
api.lookup_users in tweepy). This has rate limitation of about 100 * 180
= 18K lookups each 15 minute window.
Considering the rate limits, Second approach gives followers data 6 times faster when compared to first approach.
Below is the code which could be used to do it using 2nd approach:-
#First, Make sure you have set wait_on_rate_limit to True while connecting through Tweepy
api = tweepy.API(auth, wait_on_rate_limit=True,wait_on_rate_limit_notify=True)
#Below code will request for 5000 follower ids in one request and therefore will give 75K ids in every 15 minute window (as 15 requests could be made in each window).
followerids =[]
for user in tweepy.Cursor(api.followers_ids, screen_name=accountvar,count=5000).items():
followerids.append(user)
print (len(followerids))
#Below function could be used to make lookup requests for ids 100 at a time leading to 18K lookups in each 15 minute window
def get_usernames(userids, api):
fullusers = []
u_count = len(userids)
print(u_count)
try:
for i in range(int(u_count/100) + 1):
end_loc = min((i + 1) * 100, u_count)
fullusers.extend(
api.lookup_users(user_ids=userids[i * 100:end_loc])
)
return fullusers
except:
import traceback
traceback.print_exc()
print ('Something went wrong, quitting...')
#Calling the function below with the list of followeids and tweepy api connection details
fullusers = get_usernames(followerids,api)
Hope this helps.
Similiar approach could be followed for fetching friends details by using api.friends_ids inplace of api.followers_ids
If you need more resources for rate limit comparison and for 2nd approach, check below links:-
https://github.com/tweepy/tweepy/issues/627
https://labsblog.f-secure.com/2018/02/27/how-to-get-twitter-follower-data-using-python-and-tweepy/
The twitter API only allows 100 users to be searched for at a time. That's why no matter how many you input to it you'll get 100. The followers_id is giving you the correct number of users but you're being limited by GET users/lookup
What you'll need to do is iterate through each 100 users but staying within the rate limit.
Related
I want to list all customers in Chargebee, and currently I have 101 customers. I use chargebee.Customer.list({'limit': 100}) because in the documentation it says that "optional, integer, default=10, min=1, max=100".
Documentation link: https://apidocs.chargebee.com/docs/api/customers?prod_cat_ver=2#list_customers
I have more than 100 customers but I still do not have "next_offset" returned.
May I ask how I can get all customers? Or 100 customers at first with an offset?
Thanks.
May I ask how I can get all customers?
According to the Chargebee documentation for list operations (here), one has to repeatedly call the API until the next_offset field is missing in the response object. The offset parameter allows us to start where we left off.
One can use the limit parameter to ask Chargebee for more customers in one call, reducing the number of total calls we have to make.
I included a Python snippet as an example of how to implement pagination.
import chargebee
def get_all_customers(params, limit=100, offset=None):
customers = list()
params['limit'] = limit
params['offset'] = offset
while True:
entries = chargebee.Customer.list(params)
customers.extend(entries.response)
if entries.next_offset:
params['offset'] = entries.next_offset
else:
break
return customers
def main():
chargebee.configure('<api_token>', '<site>')
customers = get_all_customers({"first_name[is]": "John"})
for customer in customers:
print(customer)
if __name__ == '__main__':
main()
I have more than 100 customers, but I still do not have "next_offset" returned.
Unfortunately, we cannot provide help with that, as we cannot access your Chargebee environment and debug.
I'm currently working in an API and I'm having some issues being able to return all values of something. The API allows for page sizes up to 500 records at a time. By default it uses 25 records per "page." You can also go between pages. You can edit the page sizes by adding to the endpoint: ?page[size]={number between 1-500}. My issue is I'm storing the values returned from the api in a dictionary and whenever I try to get the max amount of data possible if there's more than 500 records in the full population of the data I get no errors but whenever there's less than 500 I get a key error since it's expecting more data than is available. I don't want to have to guess the exact page size each and every request. Is there any easier way I can be going about this to be able to get all available data without having to request the exact amount of data for a page size? Ideally I'd want to be able to just get the max amount from whatever request always going to the upper bounds of what the request can return.
Thanks!
Here's an example of some code from the script:
base_api_endpoint = "{sensitive_url}?page[size]=300"
response = session.get(base_api_endpoint)
print(response)
print(" ")
d = response.json()
data = [item['attributes']['name'] for item in d['data']]
result = {}
sorted_result = {}
for i in data:
result[i] = data.count(i)
sorted_value_index = np.argsort(result.values())
dictionary_keys = list(result.keys())
sorted_dict = {dictionary_keys[i]: sorted(
result.values())[i] for i in range(len(dictionary_keys))}
sorted_d = dict( sorted(result.items(), key=operator.itemgetter(1),reverse=True))
for key, value in sorted_d.items():
print(key)
For some context, this structure of dictionary is used in other areas of the program to print both the key and value pair but for simplicities sake I'm just printing the key here.
I am trying to get a list of all JIRA issues so that I may iterate through them in the following manner:
from jira import JIRA
jira = JIRA(basic_auth=('username', 'password'), options={'server':'https://MY_JIRA.atlassian.net'})
issue = jira.issue('ISSUE_KEY')
print(issue.fields.project.key)
print(issue.fields.issuetype.name)
print(issue.fields.reporter.displayName)
print(issue.fields.summary)
print(issue.fields.comment.comments)
The code above returns the desired fields (but only an issue at a time), however, I need to be able to pass a list of all issue keys into:
issue = jira.issue('ISSUE_KEY')
The idea is to write a for loop that would go through this list and print the indicated fields.
I have not been able to populate this list.
Can someone point me in the right direction please?
def get_all_issues(jira_client, project_name, fields):
issues = []
i = 0
chunk_size = 100
while True:
chunk = jira_client.search_issues(f'project = {project_name}', startAt=i, maxResults=chunk_size, fields=fields)
i += chunk_size
issues += chunk.iterable
if i >= chunk.total:
break
return issues
issues = get_all_issues(jira, 'JIR', ["id", "fixVersion"])
options = {'server': 'YOUR SERVER NAME'}
jira = JIRA(options, basic_auth=('YOUR EMAIL', 'YOUR PASSWORD'))
size = 100
initial = 0
while True:
start= initial*size
issues = jira.search_issues('project=<NAME OR ID>', start,size)
if len(issues) == 0:
break
initial += 1
for issue in issues:
print 'ticket-no=',issue
print 'IssueType=',issue.fields.issuetype.name
print 'Status=',issue.fields.status.name
print 'Summary=',issue.fields.summary
The first 3 arguments of jira.search_issues() are the jql query, starting index (0 based hence the need for multiplying on line 6) and the maximum number of results.
You can execute a search instead of a single issue get.
Let's say your project key is PRO-KEY, to perform a search, you have to use this query:
https://MY_JIRA.atlassian.net/rest/api/2/search?jql=project=PRO-KEY
This will return the first 50 issues of the PRO-KEY and a number, in the field maxResults, of the total number of issues present.
Taken than number, you can perform others searches adding the to the previous query:
&startAt=50
With this new parameter you will be able to fetch the issues from 51 to 100 (or 50 to 99 if you consider the first issue 0).
The next query will be &startAt=100 and so on until you reach fetch all the issues in PRO-KEY.
If you wish to fetch more than 50 issues, add to the query:
&maxResults=200
You can use the jira.search_issues() method to pass in a JQL query. It will return the list of issues matching the JQL:
issues_in_proj = jira.search_issues('project=PROJ')
This will give you a list of issues that you can iterate through
Starting with Python3.8 reading all issues can be done relatively short and elegant:
issues = []
while issues_chunk := jira.search_issues('project=PROJ', startAt=len(issues)):
issues += list(issue issues_chunk)
(since we need len(issues) in every step we cannot use a list comprehension, can we?
Together with initialization and cashing and "preprocessing" (e.g. just taking issue.raw) you could write something like this:
jira = jira.JIRA(
server="https://jira.at-home.com",
basic_auth=json.load(open(os.path.expanduser("~/.jira-credentials")))
validate=True,
)
issues = json.load(open("jira_issues.json"))
while issues_chunk := jira.search_issues('project=PROJ', startAt=len(issues)):
issues += [issue.raw for issue issues_chunk]
json.dump(issues, open("jira_issues.json", "w"))
I am trying to build a script where I can get the check-ins for a specific location. For some reason when I specify lat, long coords VK never returns any check-ins so I have to fetch location IDs first and then request the check-ins from that list. However I am not sure on how to use the offset feature, which I presume is supposed to work somewhat like a pagination function.
So far I have this:
import vk
import json
app_id = #enter app id
login_nr = #enter your login phone or email
password = '' #enter password
vkapi = vk.API(app_id, login_nr, password)
vkapi.getServerTime()
def get_places(lat, lon, rad):
name_list = []
try:
locations = vkapi.places.search(latitude=lat, longitude=lon, radius=rad)
name_list.append(locations['items'])
except Exception, e:
print '*********------------ ERROR ------------*********'
print str(e)
return name_list
# Returns last checkins up to a maximum of 100
# Define the number of checkins you want, 100 being maximum
def get_checkins_id(place_id,check_count):
checkin_list= []
try:
checkins = vkapi.places.getCheckins(place = place_id, count = check_count)
checkin_list.append(checkins['items'])
except Exception, e:
print '*********------------ ERROR ------------*********'
print str(e)
return checkin_list
What I would like to do eventually is combine the two into a single function but before that I have to figure out how offset works, the current VK API documentation does not explain that too well. I would like the code to read something similar to:
def get_users_list_geo(lat, lon, rad, count):
users_list = []
locations_lists = []
users = []
locations = vkapi.places.search(latitude=lat, longitude=lon, radius=rad)
for i in locations[0]:
locations_list.append(i['id'])
for i in locations:
# Get each location ID
# Get Checkins for location
# Append checkin and ID to the list
From what I understand I have to count the offset when getting the check-ins and then somehow account for locations that have more than 100 check-ins. Anyways, I would greatly appreciate any type of help, advice, or anything. If you have any suggestions on the script I would love to hear them as well. I am teaching myself Python so clearly I am not very good so far.
Thanks!
I've worked with VK API with javascript, but I think, logic is the same.
TL;DR: Offset is a number of results (starting with the first) which API should skip in response
For example, you make query, which should return 1000 results (lets imagine that you know exact number of results).
But VK return to you only 100 per request. So, how to get other 900?
You say to API: give me next 100 results. Next is offset - number of results you want to skip because you've already handled them. So, VK API takes 1000 results, skip first 100, and return to you next (second) 100.
Also, if you are talking about this method (http://vk.com/dev/places.getCheckins) in first paragraph, please check that your lat/long are float, not integer. And it could be useful to try swap lat/long - maybe you got them mixed up?
Using the following code I get max of 100 records by this call suppose to rerun a token for next 100 records, how can I user this token to get next 100 records.
ref: http://boto.readthedocs.org/en/latest/ref/rds.html look for get_all_dbsnapshots and max_records
all_dbsnapshots = rdsConn.get_all_dbsnapshots()
If you want to increase the the amount of records returned, you can use the parameter "max_records" when requesting the snapshots. The default is 100.
all_dbsnapshots = rdsConn.get_all_dbsnapshots(max_records=10000)
If more than that many records exist, you can use the MoreToken returned from the previous request to iterate by changing the value of marker.
additional_snapshots = rdsConn.get_all_dbsnapshots(marker=MoreToken)
For additional help, see the boto documentation: http://boto.readthedocs.org/en/latest/ref/rds.html
Max limit is 100 after that you call the token as follows (its working fine now)
market=None
SnapshotTest(rdsConn, marker):
all_dbsnapshots = rdsConn.get_all_dbsnapshots(marker=marker)
marker=all_dbsnapshots.marker
for snapshot_name in all_dbsnapshots:
print snapshot_name
if len(marker) > 0:
try:
SnapshotTest(rdsConn, marker) #recursive call
except:
pass
You can try something like this to interact through all your snapshots:
rds_conn = boto.rds.connect_to_region('us-east-1')
snapshots_marker=""
while snapshots_marker != None:
snapshots = rds_conn.get_all_dbsnapshots(marker=snapshots_marker)
snapshots_marker = snapshots.marker
for snap in snapshots:
## Do something with your snapshot here