I am developing a crawler and I would like to utilize asyncio to crawl links asynchronously, in order to improve performance.
I have already utilized Celery in my synchronous crawler, which makes me able to run several crawlers in parallel. However the crawler itself is sync, so the performance is poor. I have redesigned my code using asyncio and defined the new crawler as a Celery task, but the crawler is getting network errors for no apparent reason, which makes me think is it even possible to use a coroutine as a Celery task?
The answer to that question would help me to figure out if there is a problem with my code, or there is incompatibility with the stack I'm using.
async def url_helper(url):
return url.html.absolute_links
async def url_worker(session, queue, urls, request_count):
while True:
# Get a "work item" out of the queue.
try:
current = await queue.get()
except asyncio.QueueEmpty:
return
# Stay in domain
if not config.regex_url.match(current):
# Notify the queue that the "work item" has been processed.
queue.task_done()
return
# Check against the desired level of depth
depth = urls.get(current, 0)
if depth == config.max_depth:
logger.info("Out of depth: {}".format(current))
# Notify the queue that the "work item" has been processed.
queue.task_done()
return
# Get all URLs from the page
try:
logger.info("current link: {}".format(current))
resp = await session.get(
current, allow_redirects=True, timeout=15
)
new_urls = await url_helper(resp)
request_count += 1
except TimeoutError as e:
logger.exception(e)
# Notify the queue that the "work item" has been processed.
queue.task_done()
return
except OSError as e:
logger.exception(e)
# Notify the queue that the "work item" has been processed.
queue.task_done()
return
# Add URLs to queue if internal and not already in urls
for link in new_urls:
if config.regex_url.match(link):
if link not in urls:
urls[link] = depth + 1
await queue.put(link)
# Notify the queue that the "work item" has been processed.
queue.task_done()
async def url_crawler(url):
"""
Crawls.
"""
# Set time
start = time.time()
http_asession = requests_html.AsyncHTMLSession()
logger.debug("Sending GET request to start_url...")
start_response = await http_asession.get(url=url)
logger.debug("Received {} response.".format(start_response.status_code))
request_count = 1
# Dict to hold all URLs on the website
urls = {}
# List containing all URLs from the start page
start_urls = list(start_response.html.absolute_links)
logger.info("start urls: %s", start_urls)
# A queue to store our to-be-visited urls
queue = asyncio.Queue()
# Max concurrency
max_workers = config.max_concurrency
# Put urls from start page in the queue
for url in start_urls:
await queue.put(url)
# Create three worker tasks to process the queue concurrently.
coros = asyncio.gather(
*[
url_worker(
queue=queue,
session=http_asession,
urls=urls,
request_count=request_count,
)
for i in range(max_workers)
]
)
await coros
await http_asession.close()
logger.info("Crawled {} links.".format(len(urls)))
logger.info(urls)
logger.debug("Made {} HTTP requests.".format(request_count))
finish = time.time()
logger.info("Execution time: {}".format(finish - start))
#celery_app.task
def run_crawler(url):
loop = asyncio.get_event_loop()
loop.run_until_complete(url_crawler(url))
loop.close()
Related
I have some HTML pages that I am trying to extract the text from using asynchronous web requests through aiohttp and asyncio, after extracting them I save the files locally. I am using BeautifulSoup(under extract_text()), to process the text from the response and extract the relevant text within the HTML page(exclude the code, etc.) but facing an issue where my synchronous version of the script is faster than my asynchronous + multiprocessing.
As I understand, using the BeautifulSoup function causes the main event loop to block within parse(), so based on these two StackOverflow questions[0, 1], I figured the best thing to do was to run the extract_text() within its own process(as its a CPU task) which should prevent the event loop from blocking.
This results in the script taking 1.5x times longer than the synchronous version(with no multiprocessing).
To confirm that this was not an issue with my implementation of the asynchronous code, I removed the use of the extract_text() and instead saved the raw text from the response object. Doing this resulted in my asynchronous code being much faster, showcasing that the issue is purely from the extract_text() being run on a separate process.
Am I missing some important detail here?
import asyncio
from asyncio import Semaphore
import json
import logging
from pathlib import Path
from typing import List, Optional
import aiofiles
from aiohttp import ClientSession
import aiohttp
from bs4 import BeautifulSoup
import concurrent.futures
import functools
def extract_text(raw_text: str) -> str:
return " ".join(BeautifulSoup(raw_text, "html.parser").stripped_strings)
async def fetch_text(
url: str,
session: ClientSession,
semaphore: Semaphore,
**kwargs: dict,
) -> str:
async with semaphore:
response = await session.request(method="GET", url=url, **kwargs)
response.raise_for_status()
logging.info("Got response [%s] for URL: %s", response.status, url)
text = await response.text(encoding="utf-8")
return text
async def parse(
url: str,
session: ClientSession,
semaphore: Semaphore,
**kwargs,
) -> Optional[str]:
try:
text = await fetch_text(
url=url,
session=session,
semaphore=semaphore,
**kwargs,
)
except (
aiohttp.ClientError,
aiohttp.http_exceptions.HttpProcessingError,
) as e:
logging.error(
"aiohttp exception for %s [%s]: %s",
url,
getattr(e, "status", None),
getattr(e, "message", None),
)
except Exception as e:
logging.exception(
"Non-aiohttp exception occured: %s",
getattr(e, "__dict__", None),
)
else:
loop = asyncio.get_running_loop()
with concurrent.futures.ProcessPoolExecutor() as pool:
extract_text_ = functools.partial(extract_text, text)
text = await loop.run_in_executor(pool, extract_text_)
logging.info("Found text for %s", url)
return text
async def process_file(
url: dict,
session: ClientSession,
semaphore: Semaphore,
**kwargs: dict,
) -> None:
category = url.get("category")
link = url.get("link")
if category and link:
text = await parse(
url=f"{URL}/{link}",
session=session,
semaphore=semaphore,
**kwargs,
)
if text:
save_path = await get_save_path(
link=link,
category=category,
)
await write_file(html_text=text, path=save_path)
else:
logging.warning("Text for %s not found, skipping it...", link)
async def process_files(
html_files: List[dict],
semaphore: Semaphore,
) -> None:
async with ClientSession() as session:
tasks = [
process_file(
url=file,
session=session,
semaphore=semaphore,
)
for file in html_files
]
await asyncio.gather(*tasks)
async def write_file(
html_text: str,
path: Path,
) -> None:
# Write to file using aiofiles
...
async def get_save_path(link: str, category: str) -> Path:
# return path to save
...
async def main_async(
num_files: Optional[int],
semaphore_count: int,
) -> None:
html_files = # get all the files to process
semaphore = Semaphore(semaphore_count)
await process_files(
html_files=html_files,
semaphore=semaphore,
)
if __name__ == "__main__":
NUM_FILES = # passed through CLI args
SEMAPHORE_COUNT = # passed through CLI args
asyncio.run(
main_async(
num_files=NUM_FILES,
semaphore_count=SEMAPHORE_COUNT,
)
)
SnakeViz charts across 1000 samples
Async version with extract_text and multiprocessing
Async version without extract_text
Sync version with extract_text(notice how the html_parser from BeautifulSoup takes up the majority of the time here)
Sync version without extract_text
Here is roughly what your asynchronous program does:
Launch num_files parse() tasks concurrently
Each parse() task creates its own ProcessPoolExecutor and asynchronously awaits for extract_text (which is executed in the previously created process pool).
This is suboptimal for several reasons:
It creates num_files process pools, which are expensive to create and takes memory
Each pool is only used for one single operation, which is counterproductive: as many concurrent operations as possible should be submitted to a given pool
You are creating a new ProcessPoolExecutor each time the parse() function is called. You could try to instantiate it once (as a global for instance, of passed through a function argument):
from concurrent.futures import ProcessPoolExecutor
async def parse(loop, executor, ...):
...
text = await loop.run_in_executor(executor, extract_text)
# and then in `process_file` (or `process_files`):
async def process_file(...):
...
loop = asyncio.get_running_loop()
with ProcessPoolExecutor() as executor:
...
await process(loop, executor, ...)
I benchmarked the overhead of creating a ProcessPoolExecutor on my old MacBook Air 2015 and it shows that it is quite slow (almost 100 ms for pool creation, opening, submit and shutdown):
from time import perf_counter
from concurrent.futures import ProcessPoolExecutor
def main_1():
"""Pool crated once"""
reps = 100
t1 = perf_counter()
with ProcessPoolExecutor() as executor:
for _ in range(reps):
executor.submit(lambda: None)
t2 = perf_counter()
print(f"{(t2 - t1) / reps * 1_000} ms") # 2 ms/it
def main_2():
"""Pool created at each iteration"""
reps = 100
t1 = perf_counter()
for _ in range(reps):
with ProcessPoolExecutor() as executor:
executor.submit(lambda: None)
t2 = perf_counter()
print(f"{(t2 - t1) / reps * 1_000} ms") # 100 ms/it
if __name__ == "__main__":
main_1()
main_2()
You may again hoist it up in the process_files function, which avoid recreating the pool for each file.
Also, try to inspect more closely your first SnakeViz chart in order to know what exactly in process.py:submit is taking that much time.
One last thing, be careful of the semantics of using a context manager on an executor:
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor() as executor:
for i in range(100):
executor.submit(some_work, i)
Not only this creates and executor and submit work to it but it also waits for all work to finish before exiting the with statement.
The question is quite easy: Is it possible to test a list of URLs and store in a list only dead URLs (response code > 400) using asynchronous function?
I previously use requests library to do it and it works great but I have a big list of URLs to test and if I do it sequentially it takes more than 1 hour.
I saw a lot of article on how to make parallels requests using asyncio and aiohttp but I didn't see many things about how to test URLs with these libraries.
Is it possible to do it?
Using multithreading you could do it like this:
import requests
from concurrent.futures import ThreadPoolExecutor
results = dict()
# test the given url
# add url and status code to the results dictionary if GET succeeds but status code >= 400
# also add url to results dictionary if an exception arises with full exception details
def test_url(url):
try:
r = requests.get(url)
if r.status_code >= 400:
results[url] = f'{r.status_code=}'
except requests.exceptions.RequestException as e:
results[url] = str(e)
# return a list of URLs to be checked. probably get these from a file in reality
def get_list_of_urls():
return ['https://facebook.com', 'https://google.com', 'http://google.com/nonsense', 'http://goooglyeyes.org']
def main():
with ThreadPoolExecutor() as executor:
executor.map(test_url, get_list_of_urls())
print(results)
if __name__ == '__main__':
main()
You could do something like this using aiohttp and asyncio.
Could be done more pythonic I guess but this should work.
import aiohttp
import asyncio
urls = ['url1', 'url2']
async def test_url(session, url):
async with session.get(url) as resp:
if resp.status > 400:
return url
async def main():
async with aiohttp.ClientSession() as session:
tasks = []
for url in urls:
tasks.append(asyncio.ensure_future(test_url(session, url)))
dead_urls = await asyncio.gather(*tasks)
print(dead_urls)
asyncio.run(main())
Very basic example, but this is how I would solve it:
from aiohttp import ClientSession
from asyncio import create_task, gather, run
async def TestUrl(url, session):
async with session.get(url) as response:
if response.status >= 400:
r = await response.text()
print(f"Site: {url} is dead, response code: {str(response.status)} response text: {r}")
async def TestUrls(urls):
resultsList: list = []
async with ClientSession() as session:
# Maybe some rate limiting?
partitionTasks: list = [
create_task(TestUrl(url, session))
for url in urls]
resultsList.append(await gather(*partitionTasks, return_exceptions=False))
# do stuff with the results or return?
return(resultsList)
async def main():
urls = []
test = await TestUrls(urls)
if __name__ == "__main__":
run(main())
Try using a ThreadPoolExecutor
from concurrent.futures import ThreadPoolExecutor
import requests
url_list=[
"https://www.google.com",
"https://www.adsadasdad.com",
"https://www.14fsdfsff.com",
"https://www.ggr723tg.com",
"https://www.yyyyyyyyyyyyyyy.com",
"https://www.78sdf8sf5sf45sf.com",
"https://www.wikipedia.com",
"https://www.464dfgdfg235345.com",
"https://www.tttllldjfh.com",
"https://www.qqqqqqqqqq456.com"
]
def check(url):
r=requests.get(url)
if r.status_code < 400:
print(f"{url} is ALIVE")
with ThreadPoolExecutor(max_workers=5) as e:
for url in url_list:
e.submit(check, url)
Multiprocessing could be the better option for your problem.
from multiprocessing import Process
from multiprocessing import Manager
import requests
def checkURLStatus(url, url_status):
res = requests.get(url)
if res.status_code >= 400:
url_status[url] = "Inactive"
else:
url_status[url] = "Active"
if __name__ == "__main__":
urls = [
"https://www.google.com"
]
manager = Manager()
# to store the results for later usage
url_status = manager.dict()
procs = []
for url in urls:
proc = Process(target=checkURLStatus, args=(url, url_status))
procs.append(proc)
proc.start()
for proc in procs:
proc.join()
print(url_status.values())
url_status is a shared variable to store data for separate threads. Refer this page for more info.
There's a game named guild wars 2 and it gives us APIs to query almost everything in the game database. My aim is using python asyncio and aiohttp to write a simple crawler and get all the items' info from guild wars 2 game database.
I write a short program, it's work, but it behaves kind of weird, I guess here's something I don't understand about composing the coroutine.
First, I made a request with the Postman app. And, in the response header, there's X-Rate-Limit-Limit, 600. So I guess requests are limited at 600 per minute?
here's my question.
1、After the program finished. I checked some JSON file and they have the same content
[{"name": "Endless Fractal Challenge Mote Tonic", "description": "Transform into a Challenge Mote for 15 minutes or until hit. You cannot move while transformed."......
which means the request got a bad response, but I don't know why.
2、I tried asyncio.Semaphore, but even I limit concurrency at 5, the request goes beyond 600 very soon. So I tried to control time by add a time.sleep(0.2) at the end of request_item function. I guess the time.sleep(0.2) will suspend the whole python process for 0.2 seconds, and actually, it worked, but after executing for some time the program hangs for a long time and then gave out a lot of failed attempts. Every automatic retry still failed. I'm confused about this behavior.
async def request_item(session, item_id):
req_param_item = req_param
req_param_item['ids'] = item_id
# retry for 3 times when exception occurs.
for i in range(3):
try:
async with session.get(url_template, params=req_param_item) as response:
result = await response.json()
with open(f'item_info/{item_id}.json', 'w') as f:
json.dump(result, f)
print(item_id, 'done')
break
except Exception as e:
print(item_id, i, 'failed')
continue
time.sleep(0.2)
When I move time.sleep(0.2) into for loop inside request_item function, the whole program hangs. I have no idea what was happening.
async def request_item(session, item_id):
req_param_item = req_param
req_param_item['ids'] = item_id
for i in range(3):
try:
time.sleep(0.2)
async with session.get(url_template, params=req_param_item) as response:
result = await response.json()
with open(f'item_info/{item_id}.json', 'w') as f:
json.dump(result, f)
print(item_id, 'done')
break
except Exception as e:
print(item_id, i, 'failed')
continue
could anyone explain this a little? And is there a better solution?
I thought there are some solutions, but I can't test it. like, get the loop.time(), and suspend the whole event loop for every 600 requests. Or, add 600 requests to task_list and gather them as a group, after it's done, asyncio.run(get_item(req_ids)) again with another 600 requests.
here's all of my code.
import aiohttp
import asyncio
import httpx
import json
import math
import os
import time
tk = 'xxxxxxxx'
url_template = 'https://api.guildwars2.com/v2/items'
# get items list
req_param = {'access_token': tk}
item_list_resp = httpx.get(url_template, params=req_param)
items = item_list_resp.json()
async def request_item(session, item_id):
req_param_item = req_param
req_param_item['ids'] = item_id
for i in range(3):
try:
async with session.get(url_template, params=req_param_item) as response:
result = await response.json()
with open(f'item_info/{item_id}.json', 'w') as f:
json.dump(result, f)
print(item_id, 'done')
break
except Exception as e:
print(item_id, i, 'failed')
continue
# since the game API limit requests, I think it's ok to suspend program for a while
time.sleep(0.2)
async def get_item(item_ids: list):
task_list = []
async with aiohttp.ClientSession() as session:
for item_id in item_ids:
req = request_item(session, item_id)
task = asyncio.create_task(req)
task_list.append(task)
await asyncio.gather(*task_list)
asyncio.run(get_item(req_ids))
You are using time.sleep() instead of await asyncio.sleep(). It's block hole execution for N seconds and do it in a wrong place.
Here is what happen.
When you run
for item_id in item_ids:
req = request_item(session, item_id)
task = asyncio.create_task(req)
task_list.append(task)
You just schedule your request, but not running its. (eg. you have 1000 item_ids) So you schedule 1000 tasks and when you run await asyncio.gather(*task_list) your actually wait for all this 1000 tasks will be executed. They will fire at once.
But inside each task you run time.sleep(0.2) and you have to wait 1000*0.2 secs. Remember all tasks run at once and in general in a random order. So you run task 1 and wait 0.2 sec, then fire task 2 and you wait 0.2 sec, then task 999 fire and wait 0.2 sec and so on.
Simplest solution will be wait for minute after firing 600 requests. You need slow down inside get_item. Example code (I do not test it):
async def get_item(item_ids: list):
task_list = []
async with aiohttp.ClientSession() as session:
for n, item_id in enumerate(item_ids):
req = request_item(session, item_id)
task = asyncio.create_task(req)
task_list.append(task)
if n % 600 == 0:
await asyncio.gather(*task_list)
await asyncio.sleep(60)
task_list = []
I recommend you to use a library asyncio-throttle.
PS. With rate limit 600 per minute I do not think you need asyncio, because I am pretty sure that 600 concurrent requests will be executed in a 5-10 sec. Check twice is your 600 request takes more than 1 minute with classic requests with threads.
I am writing a get method that gets an array of ids and then makes a request for each id. The array of ids can be potentially 500+ and right now the requests are taking 20+ minutes. I have tried several different async methods like aiohttp and async and neither of them have worked to make the requests faster. Here is my code:
async def get(self):
self.set_header("Access-Control-Allow-Origin", "*")
story_list = []
duplicates = []
loop = asyncio.get_event_loop()
ids = loop.run_in_executor(None, requests.get, 'https://hacker-news.firebaseio.com/v0/newstories.json?print=pretty')
response = await ids
response_data = response.json()
print(response.text)
for url in response_data:
if url not in duplicates:
duplicates.append(url)
stories = loop.run_in_executor(None, requests.get, "https://hacker-news.firebaseio.com/v0/item/{}.json?print=pretty".format(
url))
data = await stories
if data.status_code == 200 and len(data.text) > 5:
print(data.status_code)
print(data.text)
story_list.append(data.json())
Is there a way I can use multithreading to make the requests faster?
The main issue here is that the code isn't really async.
After getting your list of URL's you are then fetching them one at a time and then awaiting the response.
A better idea would be to filter out the duplicates (use a set) before queuing all of the URLs in the executor and awaiting all of them to finish eg:
async def get(self):
self.set_header("Access-Control-Allow-Origin", "*")
stories = []
loop = asyncio.get_event_loop()
# Single executor to share resources
executor = ThreadPoolExecutor()
# Get the initial set of ids
response = await loop.run_in_executor(executor, requests.get, 'https://hacker-news.firebaseio.com/v0/newstories.json?print=pretty')
response_data = response.json()
print(response.text)
# Putting them in a set will remove duplicates
urls = set(response_data)
# Build the set of futures (returned by run_in_executor) and wait for them all to complete
responses = await asyncio.gather(*[
loop.run_in_executor(
executor, requests.get,
"https://hacker-news.firebaseio.com/v0/item/{}.json?print=pretty".format(url)
) for url in urls
])
# Process the responses
for response in responses:
if response.status_code == 200 and len(response.text) > 5:
print(response.status_code)
print(response.text)
stories.append(response.json())
return stories
I have the following code:
with ThreadPoolExecutor(max_workers=num_of_pages) as executor:
futh = [(executor.submit(self.getdata2, page, hed, data, apifolder,additional)) for page in pages]
for data in as_completed(futh):
datarALL = datarALL + data.result()
return datarALL
The num_of_pages isn't fixed but usualy it's around 250.
getdata2 func creates GET requests and return each page results:
The problem is that all 250 pages (threads) are created together. which means 250 GET requests which are called at the same time. This cause overload in the server so I get alot of retries due to delayed server response which shuts down the GET call and retry it. I want to avoid it.
I thought of creating some sort of lock which will prevent the thread/page from creating the GET request if there are more than 10 active requests. In such case it will wait till a slot is available.
Some thing like:
executing_now = []
def getdata2(...)
...
while len(executing_now)>10:
sleep(10)
executing_now.append(page)
response = requests.get(url, data=data, headers=hed, verify=False)
....
executing_now.remove(page)
return ...
Is there an existed mechanism for this in Python? This requires the threads to check for a shared memory... I want to avoid the multi threading problems such as deadlock etc..
Basically warp the GET call with a limit of how many threads can execute it on the same time.
We can use queue to "prepare" all your pages, and then you can limit to your thread pool to any number of threads since each thread will fetch needed page from queue:
# preparing here all you page objects
pages_queue = queue.Queue()
[pages_queue.put(page) for page in pages]
# ThreadPool - Each thread will take one page from queue, and when done, will fetch next one
with ThreadPoolExecutor(max_workers=10) as executor:
futh = [(executor.submit(self.getdata2, pages_queue, hed, data, apifolder,additional))]
for data in as_completed(futh):
datarALL = datarALL + data.result()
return datarALL
def getdata2(...)
...
try:
while True: # non blocking wait will raise Empty when queue is empty
page = pages_queue.get_nowait()
response = requests.get(page.url, data=data, headers=hed, verify=False)
....
return ...
except queue.Empty:
pass