I'm working with asyncio and aiohttp to call an API many times. While can print the responses, I want to collate the responses into a combined structure - a list or pandas dataframe etc.
In my example code I'm connecting to 2 urls and printing a chunk of the response. How can I collate the responses and access them all?
import asyncio, aiohttp
async def get_url(session, url, timeout=300):
async with session.get(url, timeout=timeout) as response:
http = await response.text()
print(str(http[:80])+'\n')
return http # becomes a list item when gathered
async def async_payload_wrapper(async_loop):
# test with 2 urls as PoC
urls = ['https://google.com','https://yahoo.com']
async with aiohttp.ClientSession(loop=async_loop) as session:
urls_to_check = [get_url(session, url) for url in urls]
await asyncio.gather(*urls_to_check)
if __name__ == '__main__':
event_loop = asyncio.get_event_loop()
event_loop.run_until_complete(async_payload_wrapper(event_loop))
I've tried printing to a file, and that works but it's slow and I need to read it again for further processing. I've tried appending to a global variable without success. E.g. using a variable inside get_url that is defined outside it generates an error, eg:
NameError: name 'my_list' is not defined or
UnboundLocalError: local variable 'my_list' referenced before assignment
Thanks #python_user that's exactly what I was missing and the returned type is indeed a simple list. I think I'd tried to pick up the responses inside the await part which doesn't work.
My updated PoC code below.
Adapting this for the API, JSON and pandas should now be easy : )
import asyncio, aiohttp
async def get_url(session, url, timeout=300):
async with session.get(url, timeout=timeout) as response:
http = await response.text()
return http[:80] # becomes a list element
async def async_payload_wrapper(async_loop):
# test with 2 urls as PoC
urls = ['https://google.com','https://yahoo.com']
async with aiohttp.ClientSession(loop=async_loop) as session:
urls_to_check = [get_url(session, url) for url in urls]
responses = await asyncio.gather(*urls_to_check)
print(type(responses))
print(responses)
if __name__ == '__main__':
event_loop = asyncio.get_event_loop()
event_loop.run_until_complete(async_payload_wrapper(event_loop))
Related
I am doing some requests to Azure Maps. I have a subscription key (subscriptionKey) and a list of addresses I want to look for (addresses):
query_template = 'https://atlas.microsoft.com/search/address/json?&subscription-key={}&api-version=1.0&language=en-US&query={}'
queries = [query_template.format(subscriptionKey, address) for address in addresses]
I come from this question (not necessary to read it to understand the following) and everything worked fine in my sample of 1k queries. However, when I tried 10k queries I got ValueError: too many file descriptors in select(). I added some of the answers from here and now my code looks like this:
import asyncio
from aiohttp import ClientSession
from ssl import SSLContext
from sys import platform
import nest_asyncio
nest_asyncio.apply()
# Function to get a JSON from the result of a query
async def fetch(url, session):
async with session.get(url, ssl=SSLContext()) as response:
return await response.json()
# Function to run 'fetch()' with a Semaphore and check that the result is a dictionary (JSON)
async def fetch_sem(sem, attempts, url, session):
semaphore = asyncio.Semaphore(sem)
async with semaphore:
for _ in range(attempts):
result = await fetch(url, session)
if isinstance(result, dict):
break
return result
# Function to search for all queries
async def fetch_all(sem, attempts, urls):
async with ClientSession() as session:
return await asyncio.gather(*[fetch_sem(sem, attempts, url, session) for url in urls], return_exceptions=True)
# Making the queries
if __name__ == '__main__':
if platform == 'win32':
loop = asyncio.ProactorEventLoop()
asyncio.set_event_loop(loop)
loop = asyncio.get_event_loop()
results = loop.run_until_complete(fetch_all(1000, 3, queries))
Note that I have included both asyncio.Semaphore and asyncio.ProactorEventLoop(). But despite of this additions, I still get ValueError: too many file descriptors in select().
Could I get some help with this issue? Thank you!
The purpose of the semaphore is to count how many fetch operations are currently running and enforce an upper limit. That's why you need to have one semaphore:
You could create it in fetch_all and pass to fetch_sem:
async def fetch_sem(semaphore, attempts, url, session):
async with semaphore:
...
return result
async def fetch_all(limit, attempts, urls):
semaphore = asyncio.Semaphore(limit)
async with ClientSession() as session:
return await asyncio.gather(*[fetch_sem(semaphore, attempts, url, session) for url in urls], return_exceptions=True)
....
results = loop.run_until_complete(fetch_all(1000, 3, queries))
I am attempting to return the HTTP request code from a list of urls asynchronously using this code I found online, however after a few printed I receive the error ClientConnectorError: Cannot connect to host website_i_removed :80 ssl:None [getaddrinfo failed]. As the website is valid I am confused on how it says I cannot connect. If I am doing this wrong at any point please point me in the right direction.
The past few hours I have been looking into the documentation and online for aiohttp, but they dont have an example on HTTP requests with a list of urls, and their getting started page in the docs is quite hard to follow since I am brand new to async programming. Below is the code I am using, assume urls is a list of strings.
import asyncio
from aiohttp import ClientSession
async def fetch(url, session):
async with session.get(url) as response:
code_status = response.history[0].status if response.history else response.status
print('%s -> Status Code: %s' % (url, code_status))
return await response.read()
async def bound_fetch(semaphore, url, session):
# Getter function with semaphore.
async with semaphore:
await fetch(url, session)
async def run(urls):
tasks = []
# create instance of Semaphore
semaphore = asyncio.Semaphore(1000)
async with ClientSession() as session:
for url in urls:
# pass Semaphore and session to every GET request
task = asyncio.ensure_future(bound_fetch(semaphore, url, session))
tasks.append(task)
responses = asyncio.gather(*tasks)
await responses
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(urls))
loop.run_until_complete(future)
I expected each website to print its request code to determine if they are reachable, however it says I can't connect to some despite me being able to look them up in my browser.
I'm trying to get data from a website using async in python. As an example I used this code (under A Better Coroutine Example): https://www.blog.pythonlibrary.org/2016/07/26/python-3-an-intro-to-asyncio/
Now this works fine, but it writes the binary chunks to a file and I don't want it in a file. I want the resulting data directly. But I currently have a list of coroutine objects which I can not get the data out of.
The code:
# -*- coding: utf-8 -*-
import aiohttp
import asyncio
import async_timeout
async def fetch(session, url):
with async_timeout.timeout(10):
async with session.get(url) as response:
return await response.text()
async def main(loop, urls):
async with aiohttp.ClientSession(loop=loop) as session:
tasks = [fetch(session, url) for url in urls]
await asyncio.gather(*tasks)
return tasks
# time normal way of retrieval
if __name__ == '__main__':
urls = [a list of urls..]
loop = asyncio.get_event_loop()
details_async = loop.run_until_complete(main(loop, urls))
Thanks
The problem is in return tasks at the end of main(), which is not present in the original article. Instead of returning the coroutine objects (which are not useful once passed to asyncio.gather), you should be returning the tuple returned by asyncio.gather, which contains the results of running the coroutines in correct order. For example:
async def main(loop, urls):
async with aiohttp.ClientSession(loop=loop) as session:
tasks = [fetch(session, url) for url in urls]
results = await asyncio.gather(*tasks)
return results
Now loop.run_until_complete(main(loop, urls)) will return a tuple of texts in the same order as the URLs.
I don't fully understand how asyncio and aiohttp work yet.
I am trying to make a bunch of asynchronous api requests from a list of urls and save them as a variable so I can processes them later.
so far I am generating the list which is no problem and setting up the request framework.
urls = []
for i in range(0,20):
urls.append('https://api.binance.com/api/v1/klines?symbol={}&interval=
{}&limit={}'.format(pairs_list_pairs[i],time_period,
pull_limit))
import asyncio
import aiohttp
async def request(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
return await resp.text()
async def main():
results = await asyncio.gather(
request(urls[0]),
request(urls[1]),
)
print(len(results))
print(results)
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(main())
loop.run_until_complete(loop.shutdown_asyncgens())
finally:
loop.close()
If I manually type out my requests one by one using indexing (like below), I can make the request. But the problem is that my list has upwards of 100 apis requests that I don't want to type by hand. How can I iterate through my list? Also how can I save my results into a variable? When the script ends it does not save "results" anywhere.
async def main():
results = await asyncio.gather(
request(urls[0]),
request(urls[1]),
)
print(len(results))
print(results)
Below are some sample urls to replicate the code:
[
'https://api.binance.com/api/v1/klines?symbol=ETHBTC&interval=15m&limit=1',
'https://api.binance.com/api/v1/klines?symbol=LTCBTC&interval=15m&limit=1',
'https://api.binance.com/api/v1/klines?symbol=BNBBTC&interval=15m&limit=1',
'https://api.binance.com/api/v1/klines?symbol=NEOBTC&interval=15m&limit=1',
]
To pass a variable number of arguments to gather, use the * function argument syntax:
results = await asyncio.gather(*[request(u) for u in urls])
Note that f(*args) is a standard Python feature to invoke f with positional arguments calculated at run-time.
results will be available once all requests are done, and they will be in a list in the same order as the URLs. Then you can return them from main, which will cause them to be returned by run_until_complete.
Also, you will have much better performance if you create the session only once, and reuse it for all requests, e.g. by passing it as a second argument to the request function.
Using gather and a helper function (request) are only making a quite simple task more complicated and difficult to work with. You can simply use the same ClientSession throughout all your individual requests with a loop whilst saving each response into a resultant list.
async def main():
results = []
async with aiohttp.ClientSession() as session:
for url in urls:
async with session.get(url) as resp:
results.append(await resp.text())
print(len(results))
print(results)
For the other part of your question, when you said:
When the script ends it does not save "results" anywhere.
if you meant that you want to access results outside of the main coroutine, you simply can add a return statement.
At the end of main, add:
return results
and change
loop.run_until_complete(main())
# into:
results = loop.run_until_complete(main())
First of all heres the code:
import random
import asyncio
from aiohttp import ClientSession
import csv
headers =[]
def extractsites(file):
sites = []
readfile = open(file, "r")
reader = csv.reader(readfile, delimiter=",")
raw = list(reader)
for a in raw:
sites.append((a[1]))
return sites
async def fetchheaders(url, session):
async with session.get(url) as response:
responseheader = await response.headers
print(responseheader)
return responseheader
async def bound_fetch(sem, url, session):
async with sem:
print("doing request for "+ url)
await fetchheaders(url, session)
async def run():
urls = extractsites("cisco-umbrella.csv")
tasks = []
# create instance of Semaphore
sem = asyncio.Semaphore(100)
async with ClientSession() as session:
for i in urls:
task = asyncio.ensure_future(bound_fetch(sem, "http://"+i, session))
tasks.append(task)
return tasks
def main():
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run())
loop.run_until_complete(future)
if __name__ == '__main__':
main()
Most of this code was taken from this blog post:
https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html
Here is my problem that I'm facing: I am trying to read a million urls from a file and then make async request for each of them.
But when I try to execute the code above I get the Session expired error.
This is my line of thought:
I am relatively new to async programming so bear with me.
My though process was to create a long task list (that only allows 100 parallel requests), that I build in the run function, and then pass as a future to the event loop to execute.
I have included a print debug in the bound_fetch (which I copied from the blog post) and it looks like it loops over all urls that I have and as soon as it should start making requests in the fetchheaders function I get the runtime errors.
How do I fix my code ?
A couple things here.
First, in your run function you actually want to gather the tasks there and await them to fix your session issue, like so:
async def run():
urls = ['google.com','amazon.com']
tasks = []
# create instance of Semaphore
sem = asyncio.Semaphore(100)
async with ClientSession() as session:
for i in urls:
task = asyncio.ensure_future(bound_fetch(sem, "http://"+i, session))
tasks.append(task)
await asyncio.gather(*tasks)
Second, the aiohttp API is a little odd in dealing with headers in that you can't await them. I worked around this by awaiting body so that headers are populated and then returning the headers:
async def fetchheaders(url, session):
async with session.get(url) as response:
data = await response.read()
responseheader = response.headers
print(responseheader)
return responseheader
There is some additional overhead here in pulling the body however. I couldn't find another way to load headers though without doing a body read.