I'm using python 3.7 and trying to make a crawler that can go multiple domains asynchronously. I'm using for this asyncio and aiohttp but i'm experiencing problems with the aiohttp.ClientSession. This is my reduced code:
import aiohttp
import asyncio
async def fetch(session, url):
async with session.get(url) as response:
print(await response.text())
async def main():
loop = asyncio.get_event_loop()
async with aiohttp.ClientSession(loop=loop) as session:
cwlist = [loop.create_task(fetch(session, url)) for url in ['http://python.org', 'http://google.com']]
asyncio.gather(*cwlist)
if __name__ == "__main__":
asyncio.run(main())
The thrown exception is this:
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=RuntimeError('Session is closed')>
What am i doing wrong here?
You forgot to await the asyncio.gather result:
async with aiohttp.ClientSession(loop=loop) as session:
cwlist = [loop.create_task(fetch(session, url)) for url in ['http://python.org', 'http://google.com']]
await asyncio.gather(*cwlist)
If you ever have an async with containing no await expressions you should be fairly suspicious.
Related
Much is written about creating self-contained functions for other tasks,
How to call a async function contained in a class?
How to set class attribute with await in __init__
but none address how to do so for GET requests.
Considering the following MWE -- how might this be transformed into a self-contained class?
import aiohttp
import asyncio
import time
async def get_pokemon(session, url):
async with session.get(url) as resp:
pokemon = await resp.json()
return pokemon['name']
async def main():
async with aiohttp.ClientSession() as session:
tasks = []
for number in range(1, 151):
url = f'https://pokeapi.co/api/v2/pokemon/{number}'
tasks.append(asyncio.ensure_future(get_pokemon(session, url)))
original_pokemon = await asyncio.gather(*tasks)
for pokemon in original_pokemon:
print(pokemon)
asyncio.run(main())
Code credit: https://www.twilio.com/blog/asynchronous-http-requests-in-python-with-aiohttp
I want to run many HTTP requests in parallel using python.
I tried this module named aiohttp with asyncio.
import aiohttp
import asyncio
async def main():
async with aiohttp.ClientSession() as session:
for i in range(10):
async with session.get('https://httpbin.org/get') as response:
html = await response.text()
print('done' + str(i))
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
I expect it to execute all the requests in parallel, but they are executed one by one.
Although, I later solved this using threading, but I would like to know what's wrong with this?
You need to make the requests in a concurrent manner. Currently, you have a single task defined by main() and so the http requests are run in a serial manner for that task.
You could also consider using asyncio.run() if you are using Python version 3.7+ that abstracts out creation of event loop:
import aiohttp
import asyncio
async def getResponse(session, i):
async with session.get('https://httpbin.org/get') as response:
html = await response.text()
print('done' + str(i))
async def main():
async with aiohttp.ClientSession() as session:
tasks = [getResponse(session, i) for i in range(10)] # create list of tasks
await asyncio.gather(*tasks) # execute them in concurrent manner
asyncio.run(main())
I'm using Python asyncio to implement a fast http client.
As you can see in the comments below inside the worker function I get the responses as soon as they are finished. I would like to get the responses ordered and this is why I'm using asyncio.gather.
Why is it returning None? Can anybody help?
Thank you so much!
import time
import aiohttp
import asyncio
MAXREQ = 100
MAXTHREAD = 500
URL = 'https://google.com'
g_thread_limit = asyncio.Semaphore(MAXTHREAD)
async def worker(session):
async with session.get(URL) as response:
await response.read() #If I print this line I get the responses correctly
async def run(worker, *argv):
async with g_thread_limit:
await worker(*argv)
async def main():
async with aiohttp.ClientSession() as session:
await asyncio.gather(*[run(worker, session) for _ in range(MAXREQ)])
if __name__ == '__main__':
totaltime = time.time()
print(asyncio.get_event_loop().run_until_complete(main())) #I'm getting a None here
print (time.time() - totaltime)
Your function run doesn't return nothing explicitly, so it returns None implicitly. Add return statement and you'll get a result
async def worker(session):
async with session.get(URL) as response:
return await response.read()
async def run(worker, *argv):
async with g_thread_limit:
return await worker(*argv)
When I run this it lists off the websites in the database one by one with the response code and it takes about 10 seconds to run through a very small list. It should be way faster and isn't running asynchronously but I'm not sure why.
import dblogin
import aiohttp
import asyncio
import async_timeout
dbconn = dblogin.connect()
dbcursor = dbconn.cursor(buffered=True)
dbcursor.execute("SELECT thistable FROM adatabase")
website_list = dbcursor.fetchall()
async def fetch(session, url):
with async_timeout.timeout(30):
async with session.get(url, ssl=False) as response:
await response.read()
return response.status, url
async def main():
async with aiohttp.ClientSession() as session:
for all_urls in website_list:
url = all_urls[0]
resp = await fetch(session, url)
print(resp, url)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()
dbcursor.close()
dbconn.close()
This article explains the details. What you need to do is pass each fetch call in a Future object, and then pass a list of those to either asyncio.wait or asyncio.gather depending on your needs.
Your code would look something like this:
async def fetch(session, url):
with async_timeout.timeout(30):
async with session.get(url, ssl=False) as response:
await response.read()
return response.status, url
async def main():
tasks = []
async with aiohttp.ClientSession() as session:
for all_urls in website_list:
url = all_urls[0]
task = asyncio.create_task(fetch(session, url))
tasks.append(task)
responses = await asyncio.gather(*tasks)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
future = asyncio.create_task(main())
loop.run_until_complete(future)
Also, are you sure that loop.close() call is needed? The docs mention that
The loop must not be running when this function is called. Any pending callbacks will be discarded.
This method clears all queues and shuts down the executor, but does not wait for the executor to finish.
As mentioned in the docs and in the link that #user4815162342 posted, it is better to use the create_task method instead of the ensure_future method when we know that the argument is a coroutine. Note that this was added in Python 3.7, so previous versions should continue using ensure_future instead.
I need to parse repeatedly one link content. synchronous way gives me 2-3 responses per second, i need faster (yes, i know, that too fast is bad too)
I found some async examples, but all of them show how to handle result after all links are parsed, whereas i need to parse it immediately after receiving, something like this, but this code doesn't give any speed improvement:
import aiohttp
import asyncio
import time
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
async def main():
while True:
async with aiohttp.ClientSession() as session:
html = await fetch(session, 'https://example.com')
print(time.time())
#do_something_with_html(html)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
but this code doesn't give any speed improvement
asyncio (and async/concurrency in general) gives speed improvement for I/O things that interleave each other.
When everything you do is await something and you never create any parallel tasks (using asyncio.create_task(), asyncio.ensure_future() etc.) then you are basically doing the classic synchronous programming :)
So, how to make the requests faster:
import aiohttp
import asyncio
import time
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
async def check_link(session):
html = await fetch(session, 'https://example.com')
print(time.time())
#do_something_with_html(html)
async def main():
async with aiohttp.ClientSession() as session:
while True:
asyncio.create_task(check_link(session))
await asyncio.sleep(0.05)
asyncio.run(main())
Notice: the async with aiohttp.Cliensession() as session: must be above (outside) while True: for this to work. Actually, having a single ClientSession() for all your requests is a good practice anyway.
I gave up using async, threading solved my problem, thanks to this answer
https://stackoverflow.com/a/23102874/5678457
from threading import Thread
import requests
import time
class myClassA(Thread):
def __init__(self):
Thread.__init__(self)
self.daemon = True
self.start()
def run(self):
while True:
r = requests.get('https://ex.com')
print(r.status_code, time.time())
for i in range(5):
myClassA()