The Getting Started docs for aiohttp give the following client example:
async with aiohttp.ClientSession() as session:
async with session.get('https://api.github.com/events') as resp:
print(resp.status)
print(await resp.text())
I'm having trouble understanding when the response.status will be available. My understanding is that the coroutines releases control at the await response.read() line. How can I possibly access status before waiting for the response to comeback?
Important distinction: await ... may release control of the context, for example if the awaited data is not avalible fast enough. The same goes for the async with ... statement. Therefore your code reaches the line print(resp.status) not until the resp is avalible.
For example the code:
import aiohttp
import asyncio
import urllib.parse
import datetime
async def get(session, url):
print("[{:%M:%S.%f}] getting {} ...".format(datetime.datetime.now(), urllib.parse.urlsplit(url).hostname))
async with session.get(url) as resp:
print("[{:%M:%S.%f}] {}, status: {}".format(datetime.datetime.now(), urllib.parse.urlsplit(url).hostname, resp.status))
doc = await resp.text()
print("[{:%M:%S.%f}] {}, len: {}".format(datetime.datetime.now(), urllib.parse.urlsplit(url).hostname, len(doc)))
async def main():
session = aiohttp.ClientSession()
url = "http://demo.borland.com/Testsite/stadyn_largepagewithimages.html"
f1 = asyncio.ensure_future(get(session, url))
print("[{:%M:%S.%f}] added {} to event loop".format(datetime.datetime.now(), urllib.parse.urlsplit(url).hostname))
url = "https://stackoverflow.com/questions/46445019/aiohttp-when-is-the-response-status-available"
f2 = asyncio.ensure_future(get(session, url))
print("[{:%M:%S.%f}] added {} to event loop".format(datetime.datetime.now(), urllib.parse.urlsplit(url).hostname))
url = "https://api.github.com/events"
f3 = asyncio.ensure_future(get(session, url))
print("[{:%M:%S.%f}] added {} to event loop".format(datetime.datetime.now(), urllib.parse.urlsplit(url).hostname))
await f1
await f2
await f3
session.close()
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
can produce this result:
[16:42.415481] added demo.borland.com to event loop
[16:42.415481] added stackoverflow.com to event loop
[16:42.415481] added api.github.com to event loop
[16:42.415481] getting demo.borland.com ...
[16:42.422481] getting stackoverflow.com ...
[16:42.682496] getting api.github.com ...
[16:43.002515] demo.borland.com, status: 200
[16:43.510544] stackoverflow.com, status: 200
[16:43.759558] stackoverflow.com, len: 110650
[16:43.883565] demo.borland.com, len: 239012
[16:44.089577] api.github.com, status: 200
[16:44.318590] api.github.com, len: 43055
Clarification (thx #deceze): Here you can see (look at the times between the brackets) all coroutines releasing control after sending a request to retrieve the website and a second time while awaiting the text of the response. Also borland, in contrast to stackoverflow, has so much text (other network characteristics excluded) that it's only ready to be displayed after the text from stackoverflow was printed, despite being requested earlier.
You first get the HTTP response headers, which include in the first line the status code. If you so choose you can then read the rest of the response body (here with resp.text()). Since the headers are always relatively small and the body may be very large, aiohttp gives you the chance to read both separately.
resp object is available inside async with block. Therefore resp.status is available too. Also you can call await on some methods, like resp.text() but is doesn't release control of async with block. You can work with resp even after await has been called.
Related
I'm working with asyncio and aiohttp to call an API many times. While can print the responses, I want to collate the responses into a combined structure - a list or pandas dataframe etc.
In my example code I'm connecting to 2 urls and printing a chunk of the response. How can I collate the responses and access them all?
import asyncio, aiohttp
async def get_url(session, url, timeout=300):
async with session.get(url, timeout=timeout) as response:
http = await response.text()
print(str(http[:80])+'\n')
return http # becomes a list item when gathered
async def async_payload_wrapper(async_loop):
# test with 2 urls as PoC
urls = ['https://google.com','https://yahoo.com']
async with aiohttp.ClientSession(loop=async_loop) as session:
urls_to_check = [get_url(session, url) for url in urls]
await asyncio.gather(*urls_to_check)
if __name__ == '__main__':
event_loop = asyncio.get_event_loop()
event_loop.run_until_complete(async_payload_wrapper(event_loop))
I've tried printing to a file, and that works but it's slow and I need to read it again for further processing. I've tried appending to a global variable without success. E.g. using a variable inside get_url that is defined outside it generates an error, eg:
NameError: name 'my_list' is not defined or
UnboundLocalError: local variable 'my_list' referenced before assignment
Thanks #python_user that's exactly what I was missing and the returned type is indeed a simple list. I think I'd tried to pick up the responses inside the await part which doesn't work.
My updated PoC code below.
Adapting this for the API, JSON and pandas should now be easy : )
import asyncio, aiohttp
async def get_url(session, url, timeout=300):
async with session.get(url, timeout=timeout) as response:
http = await response.text()
return http[:80] # becomes a list element
async def async_payload_wrapper(async_loop):
# test with 2 urls as PoC
urls = ['https://google.com','https://yahoo.com']
async with aiohttp.ClientSession(loop=async_loop) as session:
urls_to_check = [get_url(session, url) for url in urls]
responses = await asyncio.gather(*urls_to_check)
print(type(responses))
print(responses)
if __name__ == '__main__':
event_loop = asyncio.get_event_loop()
event_loop.run_until_complete(async_payload_wrapper(event_loop))
I'm extracting information using an API that has rate limitations. I'm doing this in an asynchronous way to speed up the process using asyncio and aiohttp. I'm gathering the calls in bunch of 10s, so I make 10 concurrent calls every time. If I receive a 429 I wait for 2 minutes and retry again... For the retry part, I'm using the backoff decorator.
My problem is that the retry is executed for the 10 calls and not only for the call failing... I'm not sure how to do that:
#backoff.on_exception(backoff.expo,aiohttp.ClientError,max_tries=20,logger=my_logger)
async def get_symbols(size,position):
async with aiohttp.ClientSession() as session:
queries = get_tasks(session,size,position)
responses = await asyncio.gather(*queries)
print("gathering responses")
for response in responses:
if response.status == 429:
print(response.headers)
print("Code 429 received waiting for 2 minutes")
print(response)
time.sleep(120)
raise aiohttp.ClientError()
else:
query_data = await response.read()
Does anyone have a way of just execute the failing call and not the whole bunch?
There are two issues in your code. First is duplicate sleep -- you probably don't understand how backoff works. Its whole point is to 1) try, 2) sleep exponentially increasing delay if there was an error, 3) retry your function/coroutine for you. Second, is that get_symbols is decorated with backoff, hence obviously it's retried as a whole.
How to improve?
Decorate individual request function
Let backoff do its "sleeping" job
Let aiohttp do its job by letting it raise for non-200 HTTP repropose codes by setting raise_for_status=True in ClientSession initialiser
It should look something like the following.
#backoff.on_exception(backoff.expo, aiohttp.ClientError, max_tries=20)
async def send_task(client, params):
async with client.get('https://python.org/', params=params) as resp:
return await resp.text()
def get_tasks(client, size, position):
for params in get_param_list(size, position)
yield send_task(client, params)
async def get_symbols(size,position):
async with aiohttp.ClientSession(raise_for_status=True) as client:
tasks = get_tasks(session, size, position)
responses = await asyncio.gather(*tasks)
for response in responses:
print(await response.read())
I am attempting to return the HTTP request code from a list of urls asynchronously using this code I found online, however after a few printed I receive the error ClientConnectorError: Cannot connect to host website_i_removed :80 ssl:None [getaddrinfo failed]. As the website is valid I am confused on how it says I cannot connect. If I am doing this wrong at any point please point me in the right direction.
The past few hours I have been looking into the documentation and online for aiohttp, but they dont have an example on HTTP requests with a list of urls, and their getting started page in the docs is quite hard to follow since I am brand new to async programming. Below is the code I am using, assume urls is a list of strings.
import asyncio
from aiohttp import ClientSession
async def fetch(url, session):
async with session.get(url) as response:
code_status = response.history[0].status if response.history else response.status
print('%s -> Status Code: %s' % (url, code_status))
return await response.read()
async def bound_fetch(semaphore, url, session):
# Getter function with semaphore.
async with semaphore:
await fetch(url, session)
async def run(urls):
tasks = []
# create instance of Semaphore
semaphore = asyncio.Semaphore(1000)
async with ClientSession() as session:
for url in urls:
# pass Semaphore and session to every GET request
task = asyncio.ensure_future(bound_fetch(semaphore, url, session))
tasks.append(task)
responses = asyncio.gather(*tasks)
await responses
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(urls))
loop.run_until_complete(future)
I expected each website to print its request code to determine if they are reachable, however it says I can't connect to some despite me being able to look them up in my browser.
When I run this it lists off the websites in the database one by one with the response code and it takes about 10 seconds to run through a very small list. It should be way faster and isn't running asynchronously but I'm not sure why.
import dblogin
import aiohttp
import asyncio
import async_timeout
dbconn = dblogin.connect()
dbcursor = dbconn.cursor(buffered=True)
dbcursor.execute("SELECT thistable FROM adatabase")
website_list = dbcursor.fetchall()
async def fetch(session, url):
with async_timeout.timeout(30):
async with session.get(url, ssl=False) as response:
await response.read()
return response.status, url
async def main():
async with aiohttp.ClientSession() as session:
for all_urls in website_list:
url = all_urls[0]
resp = await fetch(session, url)
print(resp, url)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()
dbcursor.close()
dbconn.close()
This article explains the details. What you need to do is pass each fetch call in a Future object, and then pass a list of those to either asyncio.wait or asyncio.gather depending on your needs.
Your code would look something like this:
async def fetch(session, url):
with async_timeout.timeout(30):
async with session.get(url, ssl=False) as response:
await response.read()
return response.status, url
async def main():
tasks = []
async with aiohttp.ClientSession() as session:
for all_urls in website_list:
url = all_urls[0]
task = asyncio.create_task(fetch(session, url))
tasks.append(task)
responses = await asyncio.gather(*tasks)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
future = asyncio.create_task(main())
loop.run_until_complete(future)
Also, are you sure that loop.close() call is needed? The docs mention that
The loop must not be running when this function is called. Any pending callbacks will be discarded.
This method clears all queues and shuts down the executor, but does not wait for the executor to finish.
As mentioned in the docs and in the link that #user4815162342 posted, it is better to use the create_task method instead of the ensure_future method when we know that the argument is a coroutine. Note that this was added in Python 3.7, so previous versions should continue using ensure_future instead.
First of all heres the code:
import random
import asyncio
from aiohttp import ClientSession
import csv
headers =[]
def extractsites(file):
sites = []
readfile = open(file, "r")
reader = csv.reader(readfile, delimiter=",")
raw = list(reader)
for a in raw:
sites.append((a[1]))
return sites
async def fetchheaders(url, session):
async with session.get(url) as response:
responseheader = await response.headers
print(responseheader)
return responseheader
async def bound_fetch(sem, url, session):
async with sem:
print("doing request for "+ url)
await fetchheaders(url, session)
async def run():
urls = extractsites("cisco-umbrella.csv")
tasks = []
# create instance of Semaphore
sem = asyncio.Semaphore(100)
async with ClientSession() as session:
for i in urls:
task = asyncio.ensure_future(bound_fetch(sem, "http://"+i, session))
tasks.append(task)
return tasks
def main():
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run())
loop.run_until_complete(future)
if __name__ == '__main__':
main()
Most of this code was taken from this blog post:
https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html
Here is my problem that I'm facing: I am trying to read a million urls from a file and then make async request for each of them.
But when I try to execute the code above I get the Session expired error.
This is my line of thought:
I am relatively new to async programming so bear with me.
My though process was to create a long task list (that only allows 100 parallel requests), that I build in the run function, and then pass as a future to the event loop to execute.
I have included a print debug in the bound_fetch (which I copied from the blog post) and it looks like it loops over all urls that I have and as soon as it should start making requests in the fetchheaders function I get the runtime errors.
How do I fix my code ?
A couple things here.
First, in your run function you actually want to gather the tasks there and await them to fix your session issue, like so:
async def run():
urls = ['google.com','amazon.com']
tasks = []
# create instance of Semaphore
sem = asyncio.Semaphore(100)
async with ClientSession() as session:
for i in urls:
task = asyncio.ensure_future(bound_fetch(sem, "http://"+i, session))
tasks.append(task)
await asyncio.gather(*tasks)
Second, the aiohttp API is a little odd in dealing with headers in that you can't await them. I worked around this by awaiting body so that headers are populated and then returning the headers:
async def fetchheaders(url, session):
async with session.get(url) as response:
data = await response.read()
responseheader = response.headers
print(responseheader)
return responseheader
There is some additional overhead here in pulling the body however. I couldn't find another way to load headers though without doing a body read.