Scenario: I need to gather paginated data from a web app's API which has a call limit of 100 per minute. The API object I need to return contains 100 items per page for 105 total, and growing, pages (~10,500 total items). Synchronous code was taking approximately 15 minutes to retrieve all the pages, so there was no worry about hitting the call limits then. However, I wanted to speed up the data retrieval, so I implemented asynchronous calls using asyncio and aiohttp. Data now downloads in 15 seconds - nice.
Problem: I'm now hitting the call limit thus receiving 403 errors for the last 5 or so calls.
Proposed Solution I implemented the try/except found in the get_data() function. I make the calls, and then when the call is not successful because of 403: Exceeded call limit I back off for back_off seconds and retry up to retries times:
async def get_data(session, url):
retries = 3
back_off = 60 # seconds to try again
for _ in range(retries):
try:
async with session.get(url, headers=headers) as response:
if response.status != 200:
response.raise_for_status()
print(retries, response.status, url)
return await response.json()
except aiohttp.client_exceptions.ClientResponseError as e:
retries -= 1
await asyncio.sleep(back_off)
continue
async def main():
async with aiohttp.ClientSession() as session:
attendee_urls = get_urls('attendee') # returns list of URLs to call asynchronously in get_data()
attendee_data = await asyncio.gather(*[get_data(session, attendee_url) for attendee_url in attendee_urls])
return attendee_data
if __name__ == '__main__':
data = asyncio.run(main())
Question: How do I limit the aiohttp calls so that they stay under the 100 calls/minute threshold without making a 403 request to back off? I've tried the following modules and none of them appeared to do anything: ratelimiter, ratelimit and asyncio-throttle.
Goal: To make 100 async calls per minute, but backing off and retrying if necessary (403: Exceeded call limit).
You can achieve "at most 100 requests/min" by adding a delay before every request.
100 requests/min is equivalent to 1 request/0.6s.
async def main():
async with aiohttp.ClientSession() as session:
attendee_urls = get_urls('attendee') # returns list of URLs to call asynchronously in get_data()
coroutines = []
for attendee_url in attendee_urls:
coroutines.append(get_data(session, attendee_url))
await asyncio.sleep(0.6)
attendee_data = asyncio.gather(*coroutines)
return attendee_data
Apart from the request rate limit, often, APIs also limit the no. of simultaneous requests. If so, you can use BoundedSempahore.
async def main():
sema = asyncio.BoundedSemaphore(50) # Assuming a concurrent requests limit of 50
...
coroutines.append(get_data(sema, session, attendee_url))
...
def get_data(sema, session, attendee_url):
...
for _ in range(retries):
try:
async with sema:
response = await session.get(url, headers=headers):
if response.status != 200:
response.raise_for_status()
...
Related
I am working with an API that limits to 4 requests/s. I am using asyncio and aiohttp to make asynchronous http requests. I am using Windows.
When working with the API I receive 3 status codes most commonly, 200, 400 & 429.
The 4/s issue works entirely fine when seeing many 400s but soon as I receive a 200 and try to resolve the json using response.json I begin to receive 429s for too many requests. I am trying to understand why something like this would occur.
EDIT: After doing some logging, I am seeing that the 429s appear to creep up after I have a response that takes longer than a second to complete (in the case of a 200, a large JSON response might take a bit of time to resolve). It appears that after a > 1s request elapsed time request occurs, followed by a fast one (which takes a few ms) the requests sort of "jump" ahead too quickly and overwhelm the API with more than 4 requests resolving in a second.
I am utilizing a semaphore of size 3 (trying 4 hits 429 way more often). The workflow is generally:
1. Create event loop
2. Gather tasks
3. Create http session and begin async requests with our Semaphore.
4. _fetch() is handling the specific asynchronous requests.
I am trying to understand why that when I receive 200s (which requires some JSON serialization and likely adds some latency). If I am always awaiting a sleep call of 1.5 seconds per call, why am I still able to hit rate limits? Is this fault of the API I am hitting or is there something intrinsically wrong with my async-await calls.
Below is my code:
import asyncio
import aiohttp
import time
class Request:
def __init__(self, url: str, method: str="get", payload: str=None):
self.url: str = url
self.method: str = method
self.payload: str or dict = payload or dict()
class Response:
def __init__(self, url: str, status: int, payload: dict=None, error: bool=False, text: str=None):
self.url: str = url
self.status: int = status
self.payload: dict = payload or dict()
self.error: bool = error
self.text: str = text or ''
def make_requests(headers: dict, requests: list[Request]) -> asyncio.AbstractEventLoop:
"""
requests is a list with data necessary to make requests
"""
loop: asyncio.AbstractEventLoop = asyncio.get_event_loop()
responses: asyncio.AbstractEventLoop = loop.run_until_complete(_run(headers, requests))
return responses
async def _run(headers: dict, requests: list[Request]) -> "list[Response]":
# Create a semaphore to limit how many concurrent thread processes we can run (MAXIMUM) at a time.
semaphore: asyncio.Semaphore = asyncio.Semaphore(3)
time.sleep(10) # wait 10 seconds before beginning our async requests
async with aiohttp.ClientSession(headers=headers) as session:
tasks: list[asyncio.Task] = [asyncio.create_task(_iterate(semaphore, session, request)) for request in requests]
responses: list[Response] = await asyncio.gather(*tasks)
return responses
async def _iterate(semaphore: asyncio.Semaphore, session: aiohttp.ClientSession, request: Request) -> Response:
async with semaphore:
return await _fetch(session, request)
async def _fetch(session: aiohttp.ClientSession, request: Request) -> Response:
try:
async with session.request(request.method, request.url, params=request.payload) as response:
print(f"NOW: {time.time()}")
print(f"Response Status: {response.status}.")
content: dict = await response.json()
response.raise_for_status()
await asyncio.sleep(1.5)
return Response(request.url, response.status, payload=content, error=False)
except aiohttp.ClientResponseError:
if response.status == 429:
await asyncio.sleep(12) # Back off before proceeding with more requests
return await _fetch(session, request)
else:
await asyncio.sleep(1.5)
return Response(request.url, response.status, error=True)
The 4/s issue works entirely fine when seeing many 400s, but soon as I
receive a 200 and try to resolve the JSON using response.json, I begin
to receive 429s for too many requests. I am trying to understand why
something like this would occur.
The response status does not depend on how often you call the .json method on responses. The cause can be the security of the server API is running on. At the debugging time, I had to optimize the make_requests to make it more readable.
import asyncio
import aiohttp
class Request:
def __init__(self, url: str, method: str = "get", payload: str = None):
self.url: str = url
self.method: str = method
self.payload: str or dict = payload or dict()
class Response:
def __init__(self, url: str, status: int, payload: dict = None, error: bool = False, text: str = None):
self.url: str = url
self.status: int = status
self.payload: dict = payload or dict()
self.error: bool = error
self.text: str = text or ''
async def make_requests(headers: dict, requests: "list[Request]"):
"""
This function makes concurrent requests with a semaphore.
:param headers: Main HTTP headers to use in the session.
:param requests: A list of Request objects.
:return: List of responses converted to Response objects.
"""
async def make_request(request: Request) -> Response:
"""
This closure makes limited requests at the time.
:param request: An instance of Request that describes HTTP request.
:return: A processed response.
"""
async with semaphore:
try:
response = await session.request(request.method, request.url, params=request.payload)
content = await response.json()
response.raise_for_status()
return Response(request.url, response.status, payload=content, error=False)
except (aiohttp.ClientResponseError, aiohttp.ContentTypeError, aiohttp.ClientError):
if response.status == 429:
return await make_request(request)
return Response(request.url, response.status, error=True)
semaphore = asyncio.Semaphore(3)
curr_loop = asyncio.get_running_loop()
async with aiohttp.ClientSession(headers=headers) as session:
return await asyncio.gather(*[curr_loop.create_task(make_request(request)) for request in requests])
if __name__ == "__main__":
HEADERS = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:101.0) Gecko/20100101 Firefox/101.0"
}
REQUESTS = [
Request("https://www.google.com/search?q=query1"),
Request("https://www.google.com/search?q=query2"),
Request("https://www.google.com/search?q=query3"),
Request("https://www.google.com/search?q=query4"),
Request("https://www.google.com/search?q=query5"),
]
loop = asyncio.get_event_loop()
responses = loop.run_until_complete(make_requests(HEADERS, REQUESTS))
print(responses) # [<__main__.Response object at 0x7f4f73e5da30>, <__main__.Response object at 0x7f4f73e734f0>, <__main__.Response object at 0x7f4f73e73790>, <__main__.Response object at 0x7f4f73e5d9d0>, <__main__.Response object at 0x7f4f73e73490>]
loop.close()
If you get 400s after some count of requests, you need to check what headers are sent by the browser that is missed in your request.
I am trying to understand why when I receive 200s (which requires some
JSON serialization and it likely adds some latency). If I am always
awaiting a sleep call of 1.5 seconds per call, why am I still able to
hit rate limits? Is this fault of the API I am hitting, or is there
something intrinsically wrong with my async-await calls?
I'm not sure what you meant by saying "to be able to hit rate limits", but asyncio.sleep should work properly. The script makes the first limited count of concurrent requests (in this case, semaphore allows three concurrent tasks) almost at the same time. After a request is received, it waits for 1.5 sec concurrently and returns the result of the task. The key is concurrency. If you wait with asyncio.sleep for 1.5 sec in 3 different tasks, it will wait 1.5 sec but not 4.5. If you wanted to set delays between requests, you could wait before or after calling the create_task.
There's a game named guild wars 2 and it gives us APIs to query almost everything in the game database. My aim is using python asyncio and aiohttp to write a simple crawler and get all the items' info from guild wars 2 game database.
I write a short program, it's work, but it behaves kind of weird, I guess here's something I don't understand about composing the coroutine.
First, I made a request with the Postman app. And, in the response header, there's X-Rate-Limit-Limit, 600. So I guess requests are limited at 600 per minute?
here's my question.
1、After the program finished. I checked some JSON file and they have the same content
[{"name": "Endless Fractal Challenge Mote Tonic", "description": "Transform into a Challenge Mote for 15 minutes or until hit. You cannot move while transformed."......
which means the request got a bad response, but I don't know why.
2、I tried asyncio.Semaphore, but even I limit concurrency at 5, the request goes beyond 600 very soon. So I tried to control time by add a time.sleep(0.2) at the end of request_item function. I guess the time.sleep(0.2) will suspend the whole python process for 0.2 seconds, and actually, it worked, but after executing for some time the program hangs for a long time and then gave out a lot of failed attempts. Every automatic retry still failed. I'm confused about this behavior.
async def request_item(session, item_id):
req_param_item = req_param
req_param_item['ids'] = item_id
# retry for 3 times when exception occurs.
for i in range(3):
try:
async with session.get(url_template, params=req_param_item) as response:
result = await response.json()
with open(f'item_info/{item_id}.json', 'w') as f:
json.dump(result, f)
print(item_id, 'done')
break
except Exception as e:
print(item_id, i, 'failed')
continue
time.sleep(0.2)
When I move time.sleep(0.2) into for loop inside request_item function, the whole program hangs. I have no idea what was happening.
async def request_item(session, item_id):
req_param_item = req_param
req_param_item['ids'] = item_id
for i in range(3):
try:
time.sleep(0.2)
async with session.get(url_template, params=req_param_item) as response:
result = await response.json()
with open(f'item_info/{item_id}.json', 'w') as f:
json.dump(result, f)
print(item_id, 'done')
break
except Exception as e:
print(item_id, i, 'failed')
continue
could anyone explain this a little? And is there a better solution?
I thought there are some solutions, but I can't test it. like, get the loop.time(), and suspend the whole event loop for every 600 requests. Or, add 600 requests to task_list and gather them as a group, after it's done, asyncio.run(get_item(req_ids)) again with another 600 requests.
here's all of my code.
import aiohttp
import asyncio
import httpx
import json
import math
import os
import time
tk = 'xxxxxxxx'
url_template = 'https://api.guildwars2.com/v2/items'
# get items list
req_param = {'access_token': tk}
item_list_resp = httpx.get(url_template, params=req_param)
items = item_list_resp.json()
async def request_item(session, item_id):
req_param_item = req_param
req_param_item['ids'] = item_id
for i in range(3):
try:
async with session.get(url_template, params=req_param_item) as response:
result = await response.json()
with open(f'item_info/{item_id}.json', 'w') as f:
json.dump(result, f)
print(item_id, 'done')
break
except Exception as e:
print(item_id, i, 'failed')
continue
# since the game API limit requests, I think it's ok to suspend program for a while
time.sleep(0.2)
async def get_item(item_ids: list):
task_list = []
async with aiohttp.ClientSession() as session:
for item_id in item_ids:
req = request_item(session, item_id)
task = asyncio.create_task(req)
task_list.append(task)
await asyncio.gather(*task_list)
asyncio.run(get_item(req_ids))
You are using time.sleep() instead of await asyncio.sleep(). It's block hole execution for N seconds and do it in a wrong place.
Here is what happen.
When you run
for item_id in item_ids:
req = request_item(session, item_id)
task = asyncio.create_task(req)
task_list.append(task)
You just schedule your request, but not running its. (eg. you have 1000 item_ids) So you schedule 1000 tasks and when you run await asyncio.gather(*task_list) your actually wait for all this 1000 tasks will be executed. They will fire at once.
But inside each task you run time.sleep(0.2) and you have to wait 1000*0.2 secs. Remember all tasks run at once and in general in a random order. So you run task 1 and wait 0.2 sec, then fire task 2 and you wait 0.2 sec, then task 999 fire and wait 0.2 sec and so on.
Simplest solution will be wait for minute after firing 600 requests. You need slow down inside get_item. Example code (I do not test it):
async def get_item(item_ids: list):
task_list = []
async with aiohttp.ClientSession() as session:
for n, item_id in enumerate(item_ids):
req = request_item(session, item_id)
task = asyncio.create_task(req)
task_list.append(task)
if n % 600 == 0:
await asyncio.gather(*task_list)
await asyncio.sleep(60)
task_list = []
I recommend you to use a library asyncio-throttle.
PS. With rate limit 600 per minute I do not think you need asyncio, because I am pretty sure that 600 concurrent requests will be executed in a 5-10 sec. Check twice is your 600 request takes more than 1 minute with classic requests with threads.
I'm writing an API in Flask with 1000+ requests to get data and I'd like to limit the number of requests per second. I tried with:
conn = aiohttp.TCPConnector(limit_per_host=20)
and
conn = aiohttp.TCPConnector(limit=20)
But is seems doesn't work
My code looks like this:
import logging
import asyncio
import aiohttp
logging.basicConfig(filename="logfilename.log", level=logging.INFO, format='%(asctime)s %(levelname)s:%(message)s')
async def fetch(session, url):
async with session.get(url, headers=headers) as response:
if response.status == 200:
data = await response.json()
json = data['args']
return json
async def fetch_all(urls, loop):
conn = aiohttp.TCPConnector(limit=20)
async with aiohttp.ClientSession(connector=conn, loop=loop) as session:
results = await asyncio.gather(*[fetch(session, url) for url in urls], return_exceptions=True)
return results
async def main():
loop = asyncio.new_event_loop()
url_list = []
args = ['a', 'b', 'c', +1000 others]
urls = url_list
for i in args:
base_url = 'http://httpbin.org/anything?key=%s' % i
url_list.append(base_url)
htmls = loop.run_until_complete(fetch_all(urls, loop))
for j in htmls:
key = j['key']
# save to database
logging.info(' %s was added', key)
If I run code, within 1s I send over than 200 requests. Is there any way to limit requests?
The code above works as expected (apart from a small error regarding headers being undefined).
Tested on my machine the httpbin URL responds in around 100ms which means that with a concurrency of 20 it will serve around 200 requests in 1 second (which is what you're seeing as well):
100 ms per request means 10 requests are completed in a second
10 requests per second with a concurrency of 20 means 200 requests in one second
The limit option (aiohttp.TCPConnector) limits the number of concurrent requests and does not have any time dimension.
To see the limit in action try with more values like 10, 20, 50:
# time to complete 1000 requests with different keys
aiohttp.TCPConnector(limit=10): 12.58 seconds
aiohttp.TCPConnector(limit=20): 6.57 seconds
aiohttp.TCPConnector(limit=50): 3.1 seconds
If you want to use a requests per second limit send batch of requests (20 for example) and use asyncio.sleep(1.0) to pause for a second, then send the next batch and so on.
I am writing a get method that gets an array of ids and then makes a request for each id. The array of ids can be potentially 500+ and right now the requests are taking 20+ minutes. I have tried several different async methods like aiohttp and async and neither of them have worked to make the requests faster. Here is my code:
async def get(self):
self.set_header("Access-Control-Allow-Origin", "*")
story_list = []
duplicates = []
loop = asyncio.get_event_loop()
ids = loop.run_in_executor(None, requests.get, 'https://hacker-news.firebaseio.com/v0/newstories.json?print=pretty')
response = await ids
response_data = response.json()
print(response.text)
for url in response_data:
if url not in duplicates:
duplicates.append(url)
stories = loop.run_in_executor(None, requests.get, "https://hacker-news.firebaseio.com/v0/item/{}.json?print=pretty".format(
url))
data = await stories
if data.status_code == 200 and len(data.text) > 5:
print(data.status_code)
print(data.text)
story_list.append(data.json())
Is there a way I can use multithreading to make the requests faster?
The main issue here is that the code isn't really async.
After getting your list of URL's you are then fetching them one at a time and then awaiting the response.
A better idea would be to filter out the duplicates (use a set) before queuing all of the URLs in the executor and awaiting all of them to finish eg:
async def get(self):
self.set_header("Access-Control-Allow-Origin", "*")
stories = []
loop = asyncio.get_event_loop()
# Single executor to share resources
executor = ThreadPoolExecutor()
# Get the initial set of ids
response = await loop.run_in_executor(executor, requests.get, 'https://hacker-news.firebaseio.com/v0/newstories.json?print=pretty')
response_data = response.json()
print(response.text)
# Putting them in a set will remove duplicates
urls = set(response_data)
# Build the set of futures (returned by run_in_executor) and wait for them all to complete
responses = await asyncio.gather(*[
loop.run_in_executor(
executor, requests.get,
"https://hacker-news.firebaseio.com/v0/item/{}.json?print=pretty".format(url)
) for url in urls
])
# Process the responses
for response in responses:
if response.status_code == 200 and len(response.text) > 5:
print(response.status_code)
print(response.text)
stories.append(response.json())
return stories
I'm trying to make api calls with python asynchronously. I have multiple endpoints in a list and each endpoint will return paginated results. I'm able to set up going though the multiple endpoints asynchronously, however am not able to return the paginated results of each endpoint.
From debugging, I found that the fetch_more() function runs the while loop, but doesn't actually get past the async with session.get(). So basically. The function fetch_more() is intended to get remaining results from the api call for each endpoint, however I find that the same number of results are returned with or without the fetch_more() function. I've tried looking for examples of pagination with asyncio but have not have much luck.
From my understanding, I should not be doing a request inside a while loop, however, I'm not sure a way around that in order to get paginated results.
if __name__ == 'main':
starter_fun(url, header, endpoints):
starter_func(url, header, endpoints):
loop = asyncio.get_event_loop() #event loop
future = asyncio.ensure_future(fetch_all(url, header, endpoints))
loop.run_until_complete(future) #loop until done
async def fetch_all(url, header, endpoints):
async with ClientSession() as session:
for endpoint in endpoints:
task = asyncio.ensure_future(fetch(url, header, endpoint))
tasks.append(task)
res = await asyncio.gather(*tasks) # gather task responses
return res
async def fetch(url, header, endpoint):
total_tasks = []
async with session.get(url, headers=header, params=params, ssl=False) as response:
response_json = await response.json()
data = response_json['key']
tasks = asyncio.ensure_future(fetch_more(response_json, data, params, header, url, endpoint, session)) //this is where I am getting stuck
total_tasks.append(tasks)
return data
//function to get paginated results of api endpoint
async def fetch_more(response_json, data, params, header, url, endpoint, session): //this is where I am getting stuck
while len(response_json['key']) >= params['limit']:
params['offset'] = response_json['offset'] + len(response_json['key'])
async with session.get(url, headers=header, params=params, ssl=False) as response_continued:
resp_continued_json = await response_continued.json()
data.extend(resp_continued_json[kebab_to_camel(endpoint)])
return data
Currently I am getting 1000 results with or without the fetch_more function, however it should be a lot more with the fetch_more. Any idea as to how to approach asynchronously paginating?
from aiohttp import web
async def fetch(self size: int = 10):
data = "some code to fetch data here"
def paginate(_data, _size):
import itertools
while True:
i1, i2 = itertools.tee(_data)
_data, page = (itertools.islice(i1, _size, None),
list(itertools.islice(i2, _size)))
if len(page) == 0:
break
yield page
return web.json_response(list(paginate(_data=data, _size=size)))