Problem:
Often we'd like to pull much more data than Twitter would like us to at one time. In between each query it would be wonderful if there was a simple function to call that checks if you need to wait.
Question:
What is a simple function for respecting Twitter's API limits and ensuring that any long-running-query will complete successfully without harassing Twitter and ensure the querying user does not get banned?
Ideal Answer:
The most ideal answer would be a portable function that should work in all situations. That is, finish (properly) no matter what, and respect Twitter's API rate limit rules.
Caveat
I have submitted a working answer of my own but I am unsure if there is a way to improve it.
I am developing a Python package to utilize Twitter's new V2 API. I want to make sure that I am respecting Twitter's rate limits as best as I possibly can.
Below are the two functions used to wait when needed. They check the API call response headers for remaining queries and then also rely on Twitter's HTTP codes provided here as an ultimate backup. As far as I can tell, these three HTTP codes are the only time-related errors, and the others should raise issues for an API user to inform them of whatever they are doing incorrectly.
from datetime import datetime
from osometweet.utils import pause_until
def manage_rate_limits(response):
"""Manage Twitter V2 Rate Limits
This method takes in a `requests` response object after querying
Twitter and uses the headers["x-rate-limit-remaining"] and
headers["x-rate-limit-reset"] headers objects to manage Twitter's
most common, time-dependent HTTP errors.
"""
while True:
# Get number of requests left with our tokens
remaining_requests = int(response.headers["x-rate-limit-remaining"])
# If that number is one, we get the reset-time
# and wait until then, plus 15 seconds.
# The regular 429 exception is caught below as well,
# however, we want to program defensively, where possible.
if remaining_requests == 1:
buffer_wait_time = 15
resume_time = datetime.fromtimestamp( int(response.headers["x-rate-limit-reset"]) + buffer_wait_time )
print(f"One request from being rate limited. Waiting on Twitter.\n\tResume Time: {resume_time}")
pause_until(resume_time)
# Explicitly checking for time dependent errors.
# Most of these errors can be solved simply by waiting
# a little while and pinging Twitter again - so that's what we do.
if response.status_code != 200:
# Too many requests error
if response.status_code == 429:
buffer_wait_time = 15
resume_time = datetime.fromtimestamp( int(response.headers["x-rate-limit-reset"]) + buffer_wait_time )
print(f"Too many requests. Waiting on Twitter.\n\tResume Time: {resume_time}")
pause_until(resume_time)
# Twitter internal server error
elif response.status_code == 500:
# Twitter needs a break, so we wait 30 seconds
resume_time = datetime.now().timestamp() + 30
print(f"Internal server error # Twitter. Giving Twitter a break...\n\tResume Time: {resume_time}")
pause_until(resume_time)
# Twitter service unavailable error
elif response.status_code == 503:
# Twitter needs a break, so we wait 30 seconds
resume_time = datetime.now().timestamp() + 30
print(f"Twitter service unavailable. Giving Twitter a break...\n\tResume Time: {resume_time}")
pause_until(resume_time)
# If we get this far, we've done something wrong and should exit
raise Exception(
"Request returned an error: {} {}".format(
response.status_code, response.text
)
)
# Each time we get a 200 response, exit the function and return the response object
if response.ok:
return response
Here is the pause_until function.
def pause_until(time):
""" Pause your program until a specific end time. 'time' is either
a valid datetime object or unix timestamp in seconds (i.e. seconds
since Unix epoch) """
end = time
# Convert datetime to unix timestamp and adjust for locality
if isinstance(time, datetime):
# If we're on Python 3 and the user specified a timezone,
# convert to UTC and get tje timestamp.
if sys.version_info[0] >= 3 and time.tzinfo:
end = time.astimezone(timezone.utc).timestamp()
else:
zoneDiff = pytime.time() - (datetime.now() - datetime(1970, 1, 1)).total_seconds()
end = (time - datetime(1970, 1, 1)).total_seconds() + zoneDiff
# Type check
if not isinstance(end, (int, float)):
raise Exception('The time parameter is not a number or datetime object')
# Now we wait
while True:
now = pytime.time()
diff = end - now
#
# Time is up!
#
if diff <= 0:
break
else:
# 'logarithmic' sleeping to minimize loop iterations
sleep(diff / 2)
This seems to work quite nicely but I'm not sure if there are edge-cases that will break this or if there is simply a more elegant/simple way to do this.
Related
(background)
I have an ERP application which is managed from a Weblogic Console. Recently we noticed that the same activities that we perform from the console can be performed using the vendor provided REST API calls. So we wanted to utilize this approach programatically and try to build some automations.
This is the page from where we can control one of the instance ConsoleImage
The same button acts as Stop and Start to manage the start and stop instance.
Both the start and stop have different API calls which makes sense.
The complete API doc is at : https://docs.oracle.com/cd/E61420_01/doc.92/e80710/smcrestapis.htm#BABFHBJI
(Now)
I wrote a program in python using the request method to call these APIs and it works fine.
The API response can take anywhere between 20 to 30 seconds when I use the stopInstance API
And normally takes 60 to 90 seconds when I use the startInstance API, but if there is an issue when starting the instance it takes more than 300 seconds and goes into indefinate wait.
My problem is, while starting an instance I want to wait maximum only for 100 seconds for the response. If it takes more than 100 seconds the program should display a message like "Instance was not able to start in 100 seconds"
This is my program. I am taking input from a text file and all the values present there have been verified.
import requests
import json
import importlib.machinery
import importlib.util
import numpy
import time
import sys
loader = importlib.machinery.SourceFileLoader('SM','sm_details.txt')
spec = importlib.util.spec_from_loader(loader.name, loader)
mod = importlib.util.module_from_spec(spec)
loader.exec_module(mod)
username = str(mod.username)
password = str(mod.password)
hostname = str(mod.servermanagerHostname)
portnum = str(mod.servermanagerPort)
instanceDetails = numpy.array(mod.instanceName)
authenticationAPI = "http://"+hostname+":"+portnum+"/manage/mgmtrestservice/authenticate"
startInstanceAPI = "http://"+hostname+":"+portnum+"/manage/mgmtrestservice/startinstance"
headers = {
'Content-Type':'application/json',
'Cache-Control':'no-cache',
}
data = {}
data['username']= username
data['password']= password
instanceNameDict = {'instanceName':''}
#Authentication request and storing token
response = requests.post(authenticationAPI, data=json.dumps(data), headers=headers)
token = response.headers['TOKEN']
head2 = {}
head2['TOKEN']=token
def start(instance):
print(f'\nTrying to start instance : '+instance['instanceName'])
startInstanceResponse = requests.post(startInstanceAPI,data=json.dumps(instance), headers=head2) #this is where the program is stuck and it does not move to the time.sleep step
time.sleep(100)
if startInstanceResponse.status_code == 200:
print('Instance '+instance['instanceName']+' started.')
else:
print('Could not start instance in 100 seconds')
sys.exit(1)
I would suggest you to use the timeout parameter in requests:
requests.post(startInstanceAPI,data=json.dumps(instance), headers=head2, timeout=100.0)
You can tell Requests to stop waiting for a response after a given
number of seconds with the timeout parameter. Nearly all production
code should use this parameter in nearly all requests. Failure to do
so can cause your program to hang indefinitely.
Source
Here's the requests timeout documentation, you will also find more details in there and Exception handling.
Using Google Suite for Education.
I have an app that wants to:
Create a new calendar.
Add an ACL to such calendar, so the student role would be "reader".
Everything is run through a service account.
The calendar is created just fine, but inserting the ACL throws a 404 error (redacted for privacy):
<HttpError 404 when requesting https://www.googleapis.com/calendar/v3/calendars/MY_DOMAIN_long_string%40group.calendar.google.com/acl?alt=json returned "Not Found">
The function that tries to insert the ACL:
def _create_calendar_acl(calendar_id, user, role='reader'):
credentials = service_account.Credentials.from_service_account_file(
CalendarAPI.module_path)
scoped_credentials = credentials.with_scopes(
['https://www.googleapis.com/auth/calendar'])
delegated_credentials = scoped_credentials.with_subject(
'an_admin_email')
calendar_api = googleapiclient.discovery.build('calendar',
'v3',
credentials=delegated_credentials)
body = {'role': role,
'scope': {'type': 'user',
'value': user}}
answer = calendar_api.acl().insert(calendarId=calendar_id,
body=body,
).execute()
return answer
The most funny thing is, if I retry the operation a couple times, it finally succeeds. Hence, that's what my code does:
def create_student_schedule_calendar(email):
MAX_RETRIES = 5
# Get student information
# Create calendar
answer = Calendar.create_calendar('a.calendar.owner#mydomain',
f'Student Name - schedule',
timezone='Europe/Madrid')
calendar_id = answer['id']
counter = 0
while counter < MAX_RETRIES:
try:
print('Try ' + str(counter + 1))
_create_calendar_acl(calendar_id=calendar_id, user=email) # This is where the 404 is thrown
break
except HttpError: # this is where the 404 is caught
counter += 1
print('Wait ' + str(counter ** 2))
time.sleep(counter ** 2)
continue
if counter == MAX_RETRIES:
raise Exception(f'Exceeded retries to create ACL for {calendar_id}')
Anyway, it takes four tries (between 14 and 30 seconds) to succeed - and sometimes it expires.
Would it be possible that the recently created calendar is not immediately available for the API using it?
Propagation is often an issue with cloud-based services. Large-scale online service are distributed along a network of machines which in themselves have some level of latency - there is a discrete, non-zero amount of time that information takes to propagate along a network and update everywhere.
All operations working after the first call which doesn't result in 404, is demonstrative of this process.
Mitigation:
I suggest if you're creating and editing in the same function call implementing some kind of wait/sleep for a moment to mitigate getting 404s. This can be done in python using the time library:
import time
# calendar creation code here
time.sleep(2)
# calendar edit code here
I have a large table in BigQuery, which i have to go through, get all data and process it in my GAE app. Since my table is going to be about 4m rows, i decided i have to get data via pagination mechanism implemeted in code examples here > https://cloud.google.com/bigquery/querying-data
def async_query(query):
client = bigquery.Client()
query_job = client.run_async_query(str(uuid.uuid4()), query)
query_job.use_legacy_sql = False
query_job.use_query_cache = False
query_job.begin()
wait_for_job(query_job)
query_results = query_job.results()
page_token = None
output_rows = []
while True:
rows, total_rows, page_token = query_results.fetch_data(max_results = 200, page_token = page_token)
output_rows = output_rows + rows
if not page_token:
break
def wait_for_job(job):
while True:
job.reload() # Refreshes the state via a GET request.
if job.state == 'DONE':
if job.error_result:
raise RuntimeError(job.errors)
return
time.sleep(1)
But when i execute it i receive an error:
DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded.
When max_results parameter > table size it works fine. When max_results < table size and pagination required - i get this error.
Am i missing something?
The error indicates your overall request handler processing takes too long. Very likely because of the multiple query_results.fetch_data iterations, due to pagination.
You may want to check:
Dealing with DeadlineExceededErrors
Deadline errors: 60 seconds or less in Google App Engine
You'll probably have to re-think your app a bit, maybe try to not get the whole result immediately and instead either:
get just a portion of the result
get the result on a separate request, later on, after obtaining it in the background, for example:
via a single task queue request (10 min instead or 60s deadline)
by assembling it from multiple chunks collected in separate task queue requests to really make it scalable (not sure if this actually works with bigquery, I only tried it with the datastore)
I am trying to validate what the throttle limit for an endpoint using python code.
Basically I have set Throttlelimit on the endpoint I am testing is 3calls/sec. The test does 4 calls and checks the status codes to have atleast 1 429 response.
The validation I have fails sometimes because it looks like the responses take more than a second to respond. The code I tried are:
Method1:
request = requests.Request(method='GET', url=GLOBALS["url"], params=context.payload, headers=context.headers)
context.upperlimit = int(GLOBALS["ThrottleLimit"]) + 1
reqs = [request for i in range(0, context.upperlimit)]
with BaseThrottler(name='base-throttler', reqs_over_time=(context.upperlimit, 1)) as bt:
throttled_requests = bt.multi_submit(reqs)
context.responses = [tr.response for tr in throttled_requests]
assert(429 in [ i.status_code for i in context.responses])
Method2:
request = requests.get(url=GLOBALS["url"], params=context.payload, headers=context.headers)
url = request.url
urls = set([])
for i in range(0, context.upperlimit):
urls.add(grequests.get(url))
context.responses = grequests.map(urls)
assert(429 in [ i.status_code for i in context.responses])
Is there a way that I can make sure all the responses came back in the same second and if not it should try again before failing the test.
I suppose you are using requests and grequests library. You can set a timeout as explained in the docs and also for grequests.
Plain requests
requests.get(url, timeout=1)
Using grequests
grequests.get(url, timeout=1)
Timeout value is "number of seconds"
Using timeout won't necessarily ensure the condition that you are looking for, which is that all 4 requests were received by the endpoint within one second (not that each individual response was received within one second of sending the request).
One quick and dirty way to solve this is to simply time the execution of the code, and ensure that all responses were received in less than a second (using the timeit module)
start_time = timeit.default_timer()
context.responses = grequests.map(urls)
elapsed = timeit.default_timer() - start_time
if elapsed < 1:
assert(429 in [ i.status_code for i in context.responses])
This is crude because it is checking round trip time, but will ensure that all requests were received within a second. If you need more specificity, or find that the condition is not met often enough, you could add a header to the response with the exact time the request was received by the endpoint, and then verify that all requests hit the endpoint within one second of each other.
I am trying to collect the retweets data from a Chinese microblog Sina Weibo, you can see the following code. However, I am suffering from the problem of IP request out of limit.
To solve this problem, I have to set time.sleep() for the code. You can see I attempted to add a line of ' time.sleep(10) # to opress the ip request limit' in the code. Thus python will sleep 10 secs after crawling a page of retweets (one page contains 200 retweets).
However, it still not sufficient to deal with the IP problem.
Thus, I am planning to more systematically make python sleep 60 secs after it has crawled every 20 pages. Your ideas will be appreciated.
ids=[3388154704688495, 3388154704688494, 3388154704688492]
addressForSavingData= "C:/Python27/weibo/Weibo_repost/repostOwsSave1.csv"
file = open(addressForSavingData,'wb') # save to csv file
for id in ids:
if api.rate_limit_status().remaining_hits >= 205:
for object in api.counts(ids=id):
repost_count=object.__getattribute__('rt')
print id, repost_count
pages= repost_count/200 +2 # why should it be 2? cuz python starts from 0
for page in range(1, pages):
time.sleep(10) # to opress the ip request limit
for object in api.repost_timeline(id=id, count=200, page=page): # get the repost_timeline of a weibo
"""1.1 reposts"""
mid = object.__getattribute__("id")
text = object.__getattribute__("text").encode('gb18030') # add encode here
"""1.2 reposts.user"""
user = object.__getattribute__("user") # for object in user
user_id = user.id
"""2.1 retweeted_status"""
rts = object.__getattribute__("retweeted_status")
rts_mid = rts.id # the id of weibo
"""2.2 retweeted_status.user"""
rtsuser_id = rts.user[u'id']
try:
w = csv.writer(file,delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
w.writerow(( mid,
user_id, rts_mid,
rtsuser_id, text)) # write it out
except: # Exception of UnicodeEncodeError
pass
elif api.rate_limit_status().remaining_hits < 205:
sleep_time=api.rate_limit_status().reset_time_in_seconds # time.time()
print sleep_time, api.rate_limit_status().reset_time
time.sleep(sleep_time+2)
file.close()
pass
Can you not just pace the script instead?
I suggest to make your script sleep in between each request instead of making a requests all at the same time. And say span over a minute.. This way you will also avoid any flooding bans and this is considered good behaviour.
Pacing your requests may also allow you to do things more quickly if the server does not time you out for sending too many requests.
If there is a limit to the IP sometimes their are no great and easy solutions. For example if you run apache http://opensource.adnovum.ch/mod_qos/ limits bandwidth and connections and specifically it limits;
The maximum number of concurrent requests
Limitation of the bandwidth such as the maximum allowed number of requests per second to an URL or the maximum/minimum of downloaded kbytes per second.
Limits the number of request events per second
Generic request line and header filter to deny unauthorized operations.
Request body data limitation and filtering
the maximum number of allowed connections from a single IP source address or dynamic keep-alive control.
You may want to start with these. You could send referrer URL's with your requests and make only single connections, not multiple connections.
You could also refer to this question
I figure out the solution:
first, give an integer, e.g 0
i = 0
second, in the for page loop, add the following code
for page in range(1, 300):
i += 1
if (i % 25 ==0):
print i, "find i which could be exactly divided by 25"