python for each run async function without await and parallel - python

I have 10 links in my CSV which I'm trying to run all at the same time in a loop from getTasks function. However, the way it's working now, it send a request to link 1, waits for it to complete, then link 2, etc, etc. I want the 10 links that I have to run all whenever startTask is called, leading to 10 requests a second.
Anyone know how to code that using the code below? Thanks in advance.
import requests
from bs4 import BeautifulSoup
import asyncio
def getTasks(tasks):
for task in tasks:
asyncio.run(startTask(task))
async def startTask(task):
success = await getProduct(task)
if success is None:
return startTask(task)
success = await addToCart(task)
if success is None:
return startTask(task)
...
...
...
getTasks(tasks)

First of all, to make your requests sent concurrently, you should use the aiohttp instead of the requests package that blocks I/O. And use the asyncio's semaphore to limit the count of concurrent processes at the same time.
import asyncio
import aiohttp
# read links from CSV
links = [
...
]
semaphore = asyncio.BoundedSemaphore(10)
# 10 is the max count of concurrent tasks
# that can be processed at the same time.
# In this case, tasks are requests.
async def async_request(url):
async with aiohttp.ClientSession() as session:
async with semaphore, session.get(url) as response:
return await response.text()
async def main():
result = await asyncio.gather(*[
async_request(link) for link in links
])
print(result) # [response1, response2, ...]
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()

Related

Python Requests : How to send many post requests in the same time wait response the first and second

_1 = requests.post(logUrl,data=userDayta, headers=logHead)
i want send many post request like this in the same time
Here are two methods that work:
The reason that i posted both methods in full is that the examples given on the main website throw (RuntimeError: Event loop is closed) messages whereas both of these work.
Method 1: few lines of code, but longer run time (6.5 seconds):
import aiohttp
import asyncio
import time
start_time = time.time()
async def main():
async with aiohttp.ClientSession() as session:
for number in range(1, 151):
pokemon_url = f'https://pokeapi.co/api/v2/pokemon/{number}'
async with session.get(pokemon_url) as resp:
pokemon = await resp.json()
print(pokemon['name'])
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
# Wait 250 ms for the underlying SSL connections to close
loop.run_until_complete(asyncio.sleep(0.250))
loop.close()
Method 2: more code, but shorter run time (1.5 seconds):
import aiohttp
import asyncio
import time
start_time = time.time()
async def get_pokemon(session, url):
async with session.get(url) as resp:
pokemon = await resp.json()
return pokemon['name']
async def main():
async with aiohttp.ClientSession() as session:
tasks = []
for number in range(1, 151):
url = f'https://pokeapi.co/api/v2/pokemon/{number}'
tasks.append(asyncio.ensure_future(get_pokemon(session, url)))
original_pokemon = await asyncio.gather(*tasks)
for pokemon in original_pokemon:
print(pokemon)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
# Wait 250 ms for the underlying SSL connections to close
loop.run_until_complete(asyncio.sleep(0.250))
loop.close()
both methods are considerable faster than the equivalent synchronous code !!

Asyncio not running Aiohttp requests in parallel

I want to run many HTTP requests in parallel using python.
I tried this module named aiohttp with asyncio.
import aiohttp
import asyncio
async def main():
async with aiohttp.ClientSession() as session:
for i in range(10):
async with session.get('https://httpbin.org/get') as response:
html = await response.text()
print('done' + str(i))
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
I expect it to execute all the requests in parallel, but they are executed one by one.
Although, I later solved this using threading, but I would like to know what's wrong with this?
You need to make the requests in a concurrent manner. Currently, you have a single task defined by main() and so the http requests are run in a serial manner for that task.
You could also consider using asyncio.run() if you are using Python version 3.7+ that abstracts out creation of event loop:
import aiohttp
import asyncio
async def getResponse(session, i):
async with session.get('https://httpbin.org/get') as response:
html = await response.text()
print('done' + str(i))
async def main():
async with aiohttp.ClientSession() as session:
tasks = [getResponse(session, i) for i in range(10)] # create list of tasks
await asyncio.gather(*tasks) # execute them in concurrent manner
asyncio.run(main())

Asyncio aiohttp populate list of HTML from list of URL

I'm trying to get a list of HTML source code from a list of URLs asynchronously with asyncio and aiohttp but I get 2 Exceptions:
TypeError('An asyncio.Future, a coroutine or an awaitable is ' TypeError: An asyncio.Future, a coroutine or an awaitable is required Exception ignored in: <function _ProactorBasePipeTransport.__del__ at 0x000001C92141F310>
raise RuntimeError('Event loop is closed') RuntimeError: Event loop is closed
This is my code:
import asyncio
import aiohttpasync
def main():
tasks = []
html_list = []
async with aiohttp.ClientSession() as session:
for url in ['http://www.apple.com', 'http://www.google.cl']:
async with session.get(url) as resp:
tasks.append(html_list.append(await resp.read()))
print(html_list[0])
print(html_list[1])
await(asyncio.wait(tasks))
if __name__ == '__main__':
asyncio.run(main())
This code gets to print the html codes of ['http://www.apple.com', 'http://www.google.cl'] but immediatly after I get the aforementioned Exceptions.
There are a few things that aren't quite right with your example:
You're not running anything concurrently. When you await resp.read() in your for loop, this blocks main until the result comes back. Perhaps this is what you want, but you need to use asyncio.create_task to run your requests concurrently.
As pointed out, you don't need the task array at all because of point number one. You can just append to html_list.
You don't need to call asynio.wait because you're not awaiting any tasks or coroutines at this point.
You can resolve your direct issues by what you have done in the comments, but a version that runs concurrently looks like this:
import asyncio
import aiohttp
async def read_url(session, url):
async with session.get(url) as resp:
return await resp.read()
async def main():
tasks = []
async with aiohttp.ClientSession() as session:
for url in ['http://www.apple.com', 'http://www.google.cl']:
tasks.append(asyncio.create_task(read_url(session, url)))
html_list = await asyncio.gather(*tasks)
print(html_list[0])
print(html_list[1])
if __name__ == '__main__':
asyncio.run(main())
Here we define a coroutine read_url which gets the contents of a single url. Then in the loop, you create a task for reading the url and append it to the tasks list. Then you use asyncio.gather - this will wait for all tasks to finish concurrently.
As written, I'm unable to reproduce your RuntimeError: Event loop is closed error.

Where to put BeautifulSoup code in Asyncio Web Scraping Application

I need to scrape and get the raw text of the body paragraphs for many (5-10k per day) news articles. I've written some threading code, but given the highly I/O bound nature of this project I am dabbling in asyncio. The code snippet below is not any faster than a 1-threaded version, and far worse than my threaded version. Could anyone tell me what I am doing wrong? Thank you!
async def fetch(session,url):
async with session.get(url) as response:
return await response.text()
async def scrape_urls(urls):
results = []
tasks = []
async with aiohttp.ClientSession() as session:
for url in urls:
html = await fetch(session,url)
soup = BeautifulSoup(html,'html.parser')
body = soup.find('div', attrs={'class':'entry-content'})
paras = [normalize('NFKD',para.get_text()) for para in body.find_all('p')]
results.append(paras)
return results
await means "wait until the result is ready", so when you await the fetching in each loop iteration, you request (and get) sequential execution. To parallelize fetching, you need to spawn each fetch into a background tasks using something like asyncio.create_task(fetch(...)), and then await them, similar to how you'd do it with threads. Or even more simply, you can let the asyncio.gather convenience function do it for you. For example (untested):
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
def parse(html):
soup = BeautifulSoup(html,'html.parser')
body = soup.find('div', attrs={'class':'entry-content'})
return [normalize('NFKD',para.get_text())
for para in body.find_all('p')]
async def fetch_and_parse(session, url):
html = await fetch(session, url)
paras = parse(html)
return paras
async def scrape_urls(urls):
async with aiohttp.ClientSession() as session:
return await asyncio.gather(
*(fetch_and_parse(session, url) for url in urls)
)
If you find that this still runs slower than the multi-threaded version, it is possible that the parsing of HTML is slowing down the IO-related work. (Asyncio runs everything in a single thread by default.) To prevent CPU-bound code from interfering with asyncio, you can move the parsing to a separate thread using run_in_executor:
async def fetch_and_parse(session, url):
html = await fetch(session, url)
loop = asyncio.get_event_loop()
# run parse(html) in a separate thread, and
# resume this coroutine when it completes
paras = await loop.run_in_executor(None, parse, html)
return paras
Note that run_in_executor must be awaited because it returns an awaitable that is "woken up" when the background thread completes the given assignment. As this version uses asyncio for IO and threads for parsing, it should run about as fast as your threaded version, but scale to a much larger number of parallel downloads.
Finally, if you want the parsing to run actually in parallel, using multiple cores, you can use multi-processing instead:
_pool = concurrent.futures.ProcessPoolExecutor()
async def fetch_and_parse(session, url):
html = await fetch(session, url)
loop = asyncio.get_event_loop()
# run parse(html) in a separate process, and
# resume this coroutine when it completes
paras = await loop.run_in_executor(pool, parse, html)
return paras

Repeatedly download websites with Python async

I have a list of URL's of websites that I want to download repeatedly (in variable time intervals) using Python. It is necessary to do that asynchronously to cope with a large number of websites and/or long response times.
I've tried many things with event loops, queues, async functions, asyncio, etc., but I do not get it working. The following very simple version downloads the websites repeatedly, but it does not download the websites concurrently - instead the next download only starts after the previous one is finished.
import asyncio
import datetime
import aiohttp
def produce_helper(url: str):
# helper, because I cannot call an async function with loop.call_later
loop.create_task(produce(url))
async def produce(url: str):
await q.put(url)
print(f'{datetime.datetime.now().strftime("%H:%M:%S.%f")} - Produced {url}')
async def consume():
async with aiohttp.ClientSession() as session:
while True:
url = await q.get()
print(f'{datetime.datetime.now().strftime("%H:%M:%S.%f")} - Start: {url}')
async with session.get(url, timeout=10) as response:
print(f'{datetime.datetime.now().strftime("%H:%M:%S.%f")} - Finished: {url}')
q.task_done()
loop.call_later(10, produce_helper, url)
q = asyncio.Queue()
url_list = ["https://www.google.com/", "https://www.bing.com/", "https://www.yelp.com/"]
loop = asyncio.get_event_loop()
for url in url_list:
loop.create_task(produce(url))
loop.create_task(consume())
loop.run_forever()
Is this a suitable approach for my problem? Is there anything better conceptually?
And how do I accomplish concurrent downloads?
Any help is appreciated.
EDIT:
The challenge (as described in the comment below) is the following: After each successful download, I want to add the respective URL back to the queue - to be due after a specified waiting time (10 s in the example in my question). As soon, as it is due, I want to download the website again, add the URL back to the queue etc.

Categories

Resources