Given a list of 50k websites urls, I've been tasked to find out which of them are up/reachable. The idea is just to send a HEAD request to each URL and look at the status response. From what I hear an asynchronous approach is the way to go and for now I'm using asyncio with aiohttp.
I came up with the following code but the speed is pretty abysmal. 1000 URLs takes approximately 200 seconds on my 10mbit connection. I don't know what speeds to expect but I'm new to asynchronous programming in Python so I figured I've stepped wrong somewhere. As you can see I've tried increasing the number of allowed simultaneous connections to 1000 (up from the default of 100) and the duration for which DNS resolves are kept in the cache; neither to any great effect. The environment has Python 3.6 and aiohttp 3.5.4.
Code review unrelated to the question is also appreciated.
import asyncio
import time
from socket import gaierror
from typing import List, Tuple
import aiohttp
from aiohttp.client_exceptions import TooManyRedirects
# Using a non-default user-agent seems to avoid lots of 403 (Forbidden) errors
HEADERS = {
'user-agent': ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/45.0.2454.101 Safari/537.36'),
}
async def get_status_code(session: aiohttp.ClientSession, url: str) -> Tuple[int, str]:
try:
# A HEAD request is quicker than a GET request
resp = await session.head(url, allow_redirects=True, ssl=False, headers=HEADERS)
async with resp:
status = resp.status
reason = resp.reason
if status == 405:
# HEAD request not allowed, fall back on GET
resp = await session.get(
url, allow_redirects=True, ssl=False, headers=HEADERS)
async with resp:
status = resp.status
reason = resp.reason
return (status, reason)
except aiohttp.InvalidURL as e:
return (900, str(e))
except aiohttp.ClientConnectorError:
return (901, "Unreachable")
except gaierror as e:
return (902, str(e))
except aiohttp.ServerDisconnectedError as e:
return (903, str(e))
except aiohttp.ClientOSError as e:
return (904, str(e))
except TooManyRedirects as e:
return (905, str(e))
except aiohttp.ClientResponseError as e:
return (906, str(e))
except aiohttp.ServerTimeoutError:
return (907, "Connection timeout")
except asyncio.TimeoutError:
return (908, "Connection timeout")
async def get_status_codes(loop: asyncio.events.AbstractEventLoop, urls: List[str],
timeout: int) -> List[Tuple[int, str]]:
conn = aiohttp.TCPConnector(limit=1000, ttl_dns_cache=300)
client_timeout = aiohttp.ClientTimeout(connect=timeout)
async with aiohttp.ClientSession(
loop=loop, timeout=client_timeout, connector=conn) as session:
codes = await asyncio.gather(*(get_status_code(session, url) for url in urls))
return codes
def poll_urls(urls: List[str], timeout=20) -> List[Tuple[int, str]]:
"""
:param timeout: in seconds
"""
print("Started polling")
time1 = time.time()
loop = asyncio.get_event_loop()
codes = loop.run_until_complete(get_status_codes(loop, urls, timeout))
time2 = time.time()
dt = time2 - time1
print(f"Polled {len(urls)} websites in {dt:.1f} seconds "
f"at {len(urls)/dt:.3f} URLs/sec")
return codes
Right now you're launching all your requests at once. Thus probably bottleneck appeared somewhere. To avoid this situation semaphore can be used:
# code
sem = asyncio.Semaphore(200)
async def get_status_code(session: aiohttp.ClientSession, url: str) -> Tuple[int, str]:
try:
async with sem:
resp = await session.head(url, allow_redirects=True, ssl=False, headers=HEADERS)
# code
I tested it following way:
poll_urls([
'http://httpbin.org/delay/1'
for _
in range(2000)
])
And got:
Started polling
Polled 2000 websites in 13.2 seconds at 151.300 URLs/sec
Although it requests a single host, it shows that asynchronous approach does the job: 13 sec. < 2000 sec.
Several more things can be done:
You should play semaphore value to achieve better performance
for your concrete environment and task.
Try to lower timeout from 20 to, let's say, 5
seconds: since you're just doing head request it shouldn't take much
time. If request hangs for 5 seconds there are good chances it won't
be successful at all.
Monitoring your system resources (network/CPU/RAM) while script running
can help to find out if bottleneck is still present.
By the way, did you install aiodns (as doc suggests)?
Does disabling ssl change anything?
Try to enable debug level of logging to see if there is any useful info there
Try to setup client tracing and especially measure time for each request step to see which ones take most time
It's difficult to say more without fully reproducible situation.
Related
I'm trying to send HTTPS requests as quickly as possible. I know this would have to be concurrent requests due to my goal being 150 to 500+ requests a second. I've searched everywhere, but get no Python 3.11+ answer or one that doesn't give me errors. I'm trying to avoid AIOHTTP as the rigmarole of setting it up was a pain, which didn't even work.
The input should be an array or URLs and the output an array of the html string.
It's quite unfortunate that you couldn't setup AIOHTTP properly because this is one of the most efficient way to do asynchronous requests in Python.
Setup is not that hard:
import asyncio
import aiohttp
from time import perf_counter
def urls(n_reqs: int):
for _ in range(n_reqs):
yield "https://python.org"
async def get(session: aiohttp.ClientSession, url: str):
async with session.get(url) as response:
_ = await response.text()
async def main(n_reqs: int):
async with aiohttp.ClientSession() as session:
await asyncio.gather(
*[get(session, url) for url in urls(n_reqs)]
)
if __name__ == "__main__":
n_reqs = 10_000
start = perf_counter()
asyncio.run(main(n_reqs))
end = perf_counter()
print(f"{n_reqs / (end - start)} req/s")
You basically need to create a single ClientSession which you then reuse to send the get requests. The requests are made concurrently with to asyncio.gather(). You could also use the newer asyncio.TaskGroup:
async def main(n_reqs: int):
async with aiohttp.ClientSession() as session:
async with asyncio.TaskGroup() as group:
for url in urls(n_reqs):
group.create_task(get(session, url))
This easily achieves 500+ requests per seconds on my 7+ years old bi-core computer. Contrary to what other answers suggested, this solution does not require to spawn thousands of threads, which are expensive.
You may improve the speed even more my using a custom connector in order to allow more concurrent connections (default is 100) in a single session:
async def main(n_reqs: int):
let connector = aiohttp.TCPConnector(limit=0)
async with aiohttp.ClientSession(connector=connector) as session:
...
Hope this helps, this question asked What is the fastest way to send 10000 http requests
I observed 15000 requests in 10s, using wireshark to trap on localhost and saved packets to CSV, only counted packets that had GET in them.
FILE: a.py
from treq import get
from twisted.internet import reactor
def done(response):
if response.code == 200:
get("http://localhost:3000").addCallback(done)
get("http://localhost:3000").addCallback(done)
reactor.callLater(10, reactor.stop)
reactor.run()
Run test like this:
pip3 install treq
python3 a.py # code from above
Setup test website like this, mine was on port 3000
mkdir myapp
cd myapp
npm init
npm install express
node app.js
FILE: app.js
const express = require('express')
const app = express()
const port = 3000
app.get('/', (req, res) => {
res.send('Hello World!')
})
app.listen(port, () => {
console.log(`Example app listening on port ${port}`)
})
OUTPUT
grep GET wireshark.csv | head
"5","0.000418","::1","::1","HTTP","139","GET / HTTP/1.1 "
"13","0.002334","::1","::1","HTTP","139","GET / HTTP/1.1 "
"17","0.003236","::1","::1","HTTP","139","GET / HTTP/1.1 "
"21","0.004018","::1","::1","HTTP","139","GET / HTTP/1.1 "
"25","0.004803","::1","::1","HTTP","139","GET / HTTP/1.1 "
grep GET wireshark.csv | tail
"62145","9.994184","::1","::1","HTTP","139","GET / HTTP/1.1 "
"62149","9.995102","::1","::1","HTTP","139","GET / HTTP/1.1 "
"62153","9.995860","::1","::1","HTTP","139","GET / HTTP/1.1 "
"62157","9.996616","::1","::1","HTTP","139","GET / HTTP/1.1 "
"62161","9.997307","::1","::1","HTTP","139","GET / HTTP/1.1 "
This works, getting around 250+ requests a second.
This solution does work on Windows 10. You may have to pip install for concurrent and requests.
import time
import requests
import concurrent.futures
start = int(time.time()) # get time before the requests are sent
urls = [] # input URLs/IPs array
responses = [] # output content of each request as string in an array
# create an list of 5000 sites to test with
for y in range(5000):urls.append("https://example.com")
def send(url):responses.append(requests.get(url).content)
with concurrent.futures.ThreadPoolExecutor(max_workers=10000) as executor:
futures = []
for url in urls:futures.append(executor.submit(send, url))
end = int(time.time()) # get time after stuff finishes
print(str(round(len(urls)/(end - start),0))+"/sec") # get average requests per second
Output:
286.0/sec
Note: If your code requires something extremely time dependent, replace the middle part with this:
with concurrent.futures.ThreadPoolExecutor(max_workers=10000) as executor:
futures = []
for url in urls:
futures.append(executor.submit(send, url))
for future in concurrent.futures.as_completed(futures):
responses.append(future.result())
This is a modified version of what this site showed in an example.
The secret sauce is the max_workers=10000. Otherwise, it would average about 80/sec. Although, when setting it to beyond 1000, there wasn't any boost in speed.
I have just ran into a funny situation when testing my FastAPI Python application and thought it might be useful for some of the people who reuse sessions in their apps and want to test requests using the same app, but get stuck on weir errors like the one in the title.
Also I desire to know what is happening here.
Context
I have an async FastAPI application, that schedules multiple requests based on a unimportant configuration. After the list of request definitions is prepared, a session is created the requests are sent, possibly with delays so I can spread them in time.
To test if the requests are getting through, I have cretaed routes in my own app so I can send the testing requests back to my own application. The application basically talks to itself.
It was listening on 127.0.0.1:8000 at the time of testing.
I have following functions defined for building async tasks:
def optional_session(func):
async def wrapper(*args, **kwargs):
if 'session' not in kwargs or kwargs['session'] is None:
async with ClientSession() as session:
kwargs['session'] = session
return await func(*args, **kwargs)
else:
return await func(*args, **kwargs)
return wrapper
#optional_session
async def post_json_with_time_from_url(url: str, data: dict, session: ClientSession = None) -> Tuple[Union[dict, None], float]:
"""
A method that performs a request to a specified URL and reads the response as JSON data.
If the request is successful the data is returned. If an error occurs it is logged and the returned data is None.
:param data: data to send i the request
:param url: The URL to retrieve the image from
:return: A valid response or None
:param session:
"""
result = None, time.time()
try:
async with session.post(url, data=data) as response: # type: ClientResponse
# check if the response is valid
if response.status == 200:
try:
# we have to read the response before leaving the response context manager
result = await response.json(), time.time()
except Exception as e:
logger.error("...")
else:
logger.error(
"...")
except InvalidURL as e:
logger.error(f"...")
except Exception as e:
logger.error("...")
return result
def delay(func, seconds: int):
""""
This decorator adds a time delay to an async function.
"""
if seconds is None:
seconds = 0
async def wrapper(*args, **kwargs):
await asyncio.sleep(seconds)
return await func(*args, **kwargs)
return wrapper
def parse_get_post_request(config: ConfigContext, session: aiohttp.ClientSession = None) -> asyncio.Task:
"""
Parses the get/post request from the configuration dictionary and creates an async task for it.
"""
request_type = config.extract_key('request_type', True).lower()
delay_ = config.extract_key('delay')
url_base_ = config.extract_key('request_url_base', True)
url_suffix_ = config.extract_key('request_url_suffix', True)
url_ = urljoin(base=url_base_, url=url_suffix_)
if request_type == 'get':
return asyncio.ensure_future(
delay(get_json_with_time_from_url, delay_)(url=url_, session=session)
)
elif request_type == 'post':
return asyncio.ensure_future(
delay(post_json_with_time_from_url, delay_)(url=url_, session=session, data=config.extract_key('request_data'))
)
else:
raise ValueError(f"Unsupported request type: {request_type}")
I am creating an aiohttp session like this:
async with aiohttp.ClientSession() as session:
...
and then reusing it throughout the context code block somehting like this:
single_request_tasks = []
...
for config in configs:
single_request_tasks.append(parse_get_post_request(config=plan_config, session=session))
...
responses = await asyncio.gather(*single_request_tasks)
...
Problem
Somehow, when I send the requests altogether, and one of the requests arrives back to the app at the same time as another one, an exception is thrown:
ConnectionAbortedError: [WinError 10053] An established connection was aborted by the software in your host machine
It turns out, that for some reason, the session I share for all the requests is terminated when multiple requests arrive at the same time, using the same ClientSession instance.
I am not really sure why this happens exactly, apart from suspecting some port clash shanenigans,
but it is resolved, when I use separate session for each request or when I spread them in time with an interval of one second (for example)
Workaround
I have used separate sessions for each request when looping back to localhost.
I also avoided the issue, when I have spread the requests in time, so each one has time to complete before the other one is sent, but timing is not that reliable mechanism (since OS task scheduler, concurrency in asyncio, network latency, etc.)
This problem does not occur when sharing a session with a different host (for example when scraping images from imgur.com) so I believe the problem is related to the fact that I am looping back to the localhost.
Question
Why this happens exactly? Why is the session closed by the software in the situation I described?
Is there anything I am doing wrong with the session? How does Starlette handle loopback connections? Is this case-dependent and do I need to do more detective work somehow or is this a generally recognized, platform independent behaviour?
I am working with an API that limits to 4 requests/s. I am using asyncio and aiohttp to make asynchronous http requests. I am using Windows.
When working with the API I receive 3 status codes most commonly, 200, 400 & 429.
The 4/s issue works entirely fine when seeing many 400s but soon as I receive a 200 and try to resolve the json using response.json I begin to receive 429s for too many requests. I am trying to understand why something like this would occur.
EDIT: After doing some logging, I am seeing that the 429s appear to creep up after I have a response that takes longer than a second to complete (in the case of a 200, a large JSON response might take a bit of time to resolve). It appears that after a > 1s request elapsed time request occurs, followed by a fast one (which takes a few ms) the requests sort of "jump" ahead too quickly and overwhelm the API with more than 4 requests resolving in a second.
I am utilizing a semaphore of size 3 (trying 4 hits 429 way more often). The workflow is generally:
1. Create event loop
2. Gather tasks
3. Create http session and begin async requests with our Semaphore.
4. _fetch() is handling the specific asynchronous requests.
I am trying to understand why that when I receive 200s (which requires some JSON serialization and likely adds some latency). If I am always awaiting a sleep call of 1.5 seconds per call, why am I still able to hit rate limits? Is this fault of the API I am hitting or is there something intrinsically wrong with my async-await calls.
Below is my code:
import asyncio
import aiohttp
import time
class Request:
def __init__(self, url: str, method: str="get", payload: str=None):
self.url: str = url
self.method: str = method
self.payload: str or dict = payload or dict()
class Response:
def __init__(self, url: str, status: int, payload: dict=None, error: bool=False, text: str=None):
self.url: str = url
self.status: int = status
self.payload: dict = payload or dict()
self.error: bool = error
self.text: str = text or ''
def make_requests(headers: dict, requests: list[Request]) -> asyncio.AbstractEventLoop:
"""
requests is a list with data necessary to make requests
"""
loop: asyncio.AbstractEventLoop = asyncio.get_event_loop()
responses: asyncio.AbstractEventLoop = loop.run_until_complete(_run(headers, requests))
return responses
async def _run(headers: dict, requests: list[Request]) -> "list[Response]":
# Create a semaphore to limit how many concurrent thread processes we can run (MAXIMUM) at a time.
semaphore: asyncio.Semaphore = asyncio.Semaphore(3)
time.sleep(10) # wait 10 seconds before beginning our async requests
async with aiohttp.ClientSession(headers=headers) as session:
tasks: list[asyncio.Task] = [asyncio.create_task(_iterate(semaphore, session, request)) for request in requests]
responses: list[Response] = await asyncio.gather(*tasks)
return responses
async def _iterate(semaphore: asyncio.Semaphore, session: aiohttp.ClientSession, request: Request) -> Response:
async with semaphore:
return await _fetch(session, request)
async def _fetch(session: aiohttp.ClientSession, request: Request) -> Response:
try:
async with session.request(request.method, request.url, params=request.payload) as response:
print(f"NOW: {time.time()}")
print(f"Response Status: {response.status}.")
content: dict = await response.json()
response.raise_for_status()
await asyncio.sleep(1.5)
return Response(request.url, response.status, payload=content, error=False)
except aiohttp.ClientResponseError:
if response.status == 429:
await asyncio.sleep(12) # Back off before proceeding with more requests
return await _fetch(session, request)
else:
await asyncio.sleep(1.5)
return Response(request.url, response.status, error=True)
The 4/s issue works entirely fine when seeing many 400s, but soon as I
receive a 200 and try to resolve the JSON using response.json, I begin
to receive 429s for too many requests. I am trying to understand why
something like this would occur.
The response status does not depend on how often you call the .json method on responses. The cause can be the security of the server API is running on. At the debugging time, I had to optimize the make_requests to make it more readable.
import asyncio
import aiohttp
class Request:
def __init__(self, url: str, method: str = "get", payload: str = None):
self.url: str = url
self.method: str = method
self.payload: str or dict = payload or dict()
class Response:
def __init__(self, url: str, status: int, payload: dict = None, error: bool = False, text: str = None):
self.url: str = url
self.status: int = status
self.payload: dict = payload or dict()
self.error: bool = error
self.text: str = text or ''
async def make_requests(headers: dict, requests: "list[Request]"):
"""
This function makes concurrent requests with a semaphore.
:param headers: Main HTTP headers to use in the session.
:param requests: A list of Request objects.
:return: List of responses converted to Response objects.
"""
async def make_request(request: Request) -> Response:
"""
This closure makes limited requests at the time.
:param request: An instance of Request that describes HTTP request.
:return: A processed response.
"""
async with semaphore:
try:
response = await session.request(request.method, request.url, params=request.payload)
content = await response.json()
response.raise_for_status()
return Response(request.url, response.status, payload=content, error=False)
except (aiohttp.ClientResponseError, aiohttp.ContentTypeError, aiohttp.ClientError):
if response.status == 429:
return await make_request(request)
return Response(request.url, response.status, error=True)
semaphore = asyncio.Semaphore(3)
curr_loop = asyncio.get_running_loop()
async with aiohttp.ClientSession(headers=headers) as session:
return await asyncio.gather(*[curr_loop.create_task(make_request(request)) for request in requests])
if __name__ == "__main__":
HEADERS = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:101.0) Gecko/20100101 Firefox/101.0"
}
REQUESTS = [
Request("https://www.google.com/search?q=query1"),
Request("https://www.google.com/search?q=query2"),
Request("https://www.google.com/search?q=query3"),
Request("https://www.google.com/search?q=query4"),
Request("https://www.google.com/search?q=query5"),
]
loop = asyncio.get_event_loop()
responses = loop.run_until_complete(make_requests(HEADERS, REQUESTS))
print(responses) # [<__main__.Response object at 0x7f4f73e5da30>, <__main__.Response object at 0x7f4f73e734f0>, <__main__.Response object at 0x7f4f73e73790>, <__main__.Response object at 0x7f4f73e5d9d0>, <__main__.Response object at 0x7f4f73e73490>]
loop.close()
If you get 400s after some count of requests, you need to check what headers are sent by the browser that is missed in your request.
I am trying to understand why when I receive 200s (which requires some
JSON serialization and it likely adds some latency). If I am always
awaiting a sleep call of 1.5 seconds per call, why am I still able to
hit rate limits? Is this fault of the API I am hitting, or is there
something intrinsically wrong with my async-await calls?
I'm not sure what you meant by saying "to be able to hit rate limits", but asyncio.sleep should work properly. The script makes the first limited count of concurrent requests (in this case, semaphore allows three concurrent tasks) almost at the same time. After a request is received, it waits for 1.5 sec concurrently and returns the result of the task. The key is concurrency. If you wait with asyncio.sleep for 1.5 sec in 3 different tasks, it will wait 1.5 sec but not 4.5. If you wanted to set delays between requests, you could wait before or after calling the create_task.
I'm trying to create a simple program to check proxies, and about 99% of them timeout. This is 100% NOT an issue with the proxies and I can't seem to figure it out. Here is the code:
import aiohttp,asyncio,requests,collections
import random
proxies = ['http://' +proxy for proxy in requests.get('https://raw.githubusercontent.com/TheSpeedX/PROXY-List/master/http.txt').text.split('\n')] #large proxy list
#proxies = ['http://' +proxy for proxy in requests.get('https://raw.githubusercontent.com/monosans/proxy-list/main/proxies/http.txt').text.split('\n')] #small proxy list
random.shuffle(proxies)
statuses = collections.Counter()
async def make_request(session,proxies_queue):
url = 'https://google.com'
try:
proxy = proxies_queue.get_nowait()
async with session.get(url,proxy = proxy) as resp:
statuses[resp.status]+=1
print('status')
except Exception as e:
print(f'Exception: {e}')
statuses['exception']+=1
async def make_requests(n,delay):
proxies_queue = asyncio.Queue()
for proxy in proxies:
proxies_queue.put_nowait(proxy)
tasks = []
timeout = aiohttp.ClientTimeout(total=10)
async with aiohttp.ClientSession(timeout=timeout) as session:
for i in range(n):
tasks.append(asyncio.create_task(make_request(session,proxies_queue)))
await asyncio.sleep(delay)
for task in asyncio.as_completed(tasks):
await task
asyncio.run(make_requests(100,0.2))
I've had random runs of the program produce 100% 200 status codes, so that's why i'm not trusting anyone who tells me it's the proxies. Also, I just checked the smaller proxy list the moment it was updated, and it produced the same result.
I'm working on a project that parses data from a lot of websites. Most of my code is done, so i'm looking forward to use asyncio in order to eliminate that I/O waiting, but still i wanted to test how threading would work, better or worse. To do that, i wrote some simple code to make requests to 100 websites. Btw i'm using requests_html library for that, fortunately it supports asynchronous requests as well.
asyncio code looks like:
import requests
import time
from requests_html import AsyncHTMLSession
aio_session = AsyncHTMLSession()
urls = [...] # 100 urls
async def fetch(url):
try:
response = await aio_session.get(url, timeout=5)
status = 200
except requests.exceptions.ConnectionError:
status = 404
except requests.exceptions.ReadTimeout:
status = 408
if status == 200:
return {
'url': url,
'status': status,
'html': response.html
}
return {
'url': url,
'status': status
}
def extract_html(urls):
tasks = []
for url in urls:
tasks.append(lambda url=url: fetch(url))
websites = aio_session.run(*tasks)
return websites
if __name__ == "__main__":
start_time = time.time()
websites = extract_html(urls)
print(time.time() - start_time)
Execution time (multiple tests):
13.466366291046143
14.279950618743896
12.980706453323364
BUT
If i run an example with threading:
from queue import Queue
import requests
from requests_html import HTMLSession
from threading import Thread
import time
num_fetch_threads = 50
enclosure_queue = Queue()
html_session = HTMLSession()
urls = [...] # 100 urls
def fetch(i, q):
while True:
url = q.get()
try:
response = html_session.get(url, timeout=5)
status = 200
except requests.exceptions.ConnectionError:
status = 404
except requests.exceptions.ReadTimeout:
status = 408
q.task_done()
if __name__ == "__main__":
for i in range(num_fetch_threads):
worker = Thread(target=fetch, args=(i, enclosure_queue,))
worker.setDaemon(True)
worker.start()
start_time = time.time()
for url in urls:
enclosure_queue.put(url)
enclosure_queue.join()
print(time.time() - start_time)
Execution time (multiple tests):
7.476433515548706
6.786043643951416
6.717151403427124
The thing that i don't understand .. both libraries are used against I/O problems, but why are threads faster ? The more i increase the number of threads, the more resources it uses but it's a lot faster.. Can someone please explain to me why are threads faster than asyncio in my example ?
Thanks in advance.
It turns out requests-html uses a pool of threads for running the requests. The default number of threads is the number of core on the machine multiplied by 5. This probably explains the difference in performance you noticed.
You might want to try the experiment again using aiohttp instead. In the case of aiohttp, the underlying socket for the HTTP connection is actually registered in the asyncio event loop, so no threads should be involved here.