How to keep on requesting on a set of URLs asynchronously (python)? - python

I have a set of URLs (same http server but different request parameters). What I want to achieve is to keep on requesting all of them asynchronously or in parallel, until I kill it.
I started with using threading.Thread() to create one thread per URL, and do a while True: loop in the requesting function. This worked already faster than single thread/single request of course. But I would like to achieve better performance.
Then I tried aiohttp library to run the requests asynchronously. My code is like this (FYI, each URL is composed with url_base and product.id, and each URL has a different proxy to be used for the request):
async def fetch(product, i, proxies, session):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}
while True:
try:
async with session.get(
url_base + product.id,
proxy = proxies[i],
headers=headers,
ssl = False)
) as response:
content = await response.read()
print(content)
except Exception as e:
print('ERROR ', str(e))
async def startQuery(proxies):
tasks = []
async with aiohttp.ClientSession() as session:
for [i, product] in enumerate(hermes_products):
task = asyncio.ensure_future(fetch(product, i, proxies, session))
tasks.append(task)
responses = asyncio.gather(*tasks)
await responses
loop = asyncio.get_event_loop()
loop.run_until_complete(startQuery(global_proxy))
The observation is: 1) it is not as fast as I would expect. Actually slower than using threads. 2)More importantly, the requests only returned normal in the beginning of the running, and soon almost all of them returned several errors like:
ERROR Cannot connect to host PROXY_IP:PORT ssl:False [Connect call failed ('PROXY_IP', PORT)]
or
ERROR 503, message='Too many open connections'
or
ERROR [Errno 54] Connection reset by peer
Am I doing something wrong here (particularly with the while True loop? If so, how can I achieve my goal properly?

Related

Asyncio Sleep Does not seem to Satisfy Rate Limits for certain status codes

I am working with an API that limits to 4 requests/s. I am using asyncio and aiohttp to make asynchronous http requests. I am using Windows.
When working with the API I receive 3 status codes most commonly, 200, 400 & 429.
The 4/s issue works entirely fine when seeing many 400s but soon as I receive a 200 and try to resolve the json using response.json I begin to receive 429s for too many requests. I am trying to understand why something like this would occur.
EDIT: After doing some logging, I am seeing that the 429s appear to creep up after I have a response that takes longer than a second to complete (in the case of a 200, a large JSON response might take a bit of time to resolve). It appears that after a > 1s request elapsed time request occurs, followed by a fast one (which takes a few ms) the requests sort of "jump" ahead too quickly and overwhelm the API with more than 4 requests resolving in a second.
I am utilizing a semaphore of size 3 (trying 4 hits 429 way more often). The workflow is generally:
1. Create event loop
2. Gather tasks
3. Create http session and begin async requests with our Semaphore.
4. _fetch() is handling the specific asynchronous requests.
I am trying to understand why that when I receive 200s (which requires some JSON serialization and likely adds some latency). If I am always awaiting a sleep call of 1.5 seconds per call, why am I still able to hit rate limits? Is this fault of the API I am hitting or is there something intrinsically wrong with my async-await calls.
Below is my code:
import asyncio
import aiohttp
import time
class Request:
def __init__(self, url: str, method: str="get", payload: str=None):
self.url: str = url
self.method: str = method
self.payload: str or dict = payload or dict()
class Response:
def __init__(self, url: str, status: int, payload: dict=None, error: bool=False, text: str=None):
self.url: str = url
self.status: int = status
self.payload: dict = payload or dict()
self.error: bool = error
self.text: str = text or ''
def make_requests(headers: dict, requests: list[Request]) -> asyncio.AbstractEventLoop:
"""
requests is a list with data necessary to make requests
"""
loop: asyncio.AbstractEventLoop = asyncio.get_event_loop()
responses: asyncio.AbstractEventLoop = loop.run_until_complete(_run(headers, requests))
return responses
async def _run(headers: dict, requests: list[Request]) -> "list[Response]":
# Create a semaphore to limit how many concurrent thread processes we can run (MAXIMUM) at a time.
semaphore: asyncio.Semaphore = asyncio.Semaphore(3)
time.sleep(10) # wait 10 seconds before beginning our async requests
async with aiohttp.ClientSession(headers=headers) as session:
tasks: list[asyncio.Task] = [asyncio.create_task(_iterate(semaphore, session, request)) for request in requests]
responses: list[Response] = await asyncio.gather(*tasks)
return responses
async def _iterate(semaphore: asyncio.Semaphore, session: aiohttp.ClientSession, request: Request) -> Response:
async with semaphore:
return await _fetch(session, request)
async def _fetch(session: aiohttp.ClientSession, request: Request) -> Response:
try:
async with session.request(request.method, request.url, params=request.payload) as response:
print(f"NOW: {time.time()}")
print(f"Response Status: {response.status}.")
content: dict = await response.json()
response.raise_for_status()
await asyncio.sleep(1.5)
return Response(request.url, response.status, payload=content, error=False)
except aiohttp.ClientResponseError:
if response.status == 429:
await asyncio.sleep(12) # Back off before proceeding with more requests
return await _fetch(session, request)
else:
await asyncio.sleep(1.5)
return Response(request.url, response.status, error=True)
The 4/s issue works entirely fine when seeing many 400s, but soon as I
receive a 200 and try to resolve the JSON using response.json, I begin
to receive 429s for too many requests. I am trying to understand why
something like this would occur.
The response status does not depend on how often you call the .json method on responses. The cause can be the security of the server API is running on. At the debugging time, I had to optimize the make_requests to make it more readable.
import asyncio
import aiohttp
class Request:
def __init__(self, url: str, method: str = "get", payload: str = None):
self.url: str = url
self.method: str = method
self.payload: str or dict = payload or dict()
class Response:
def __init__(self, url: str, status: int, payload: dict = None, error: bool = False, text: str = None):
self.url: str = url
self.status: int = status
self.payload: dict = payload or dict()
self.error: bool = error
self.text: str = text or ''
async def make_requests(headers: dict, requests: "list[Request]"):
"""
This function makes concurrent requests with a semaphore.
:param headers: Main HTTP headers to use in the session.
:param requests: A list of Request objects.
:return: List of responses converted to Response objects.
"""
async def make_request(request: Request) -> Response:
"""
This closure makes limited requests at the time.
:param request: An instance of Request that describes HTTP request.
:return: A processed response.
"""
async with semaphore:
try:
response = await session.request(request.method, request.url, params=request.payload)
content = await response.json()
response.raise_for_status()
return Response(request.url, response.status, payload=content, error=False)
except (aiohttp.ClientResponseError, aiohttp.ContentTypeError, aiohttp.ClientError):
if response.status == 429:
return await make_request(request)
return Response(request.url, response.status, error=True)
semaphore = asyncio.Semaphore(3)
curr_loop = asyncio.get_running_loop()
async with aiohttp.ClientSession(headers=headers) as session:
return await asyncio.gather(*[curr_loop.create_task(make_request(request)) for request in requests])
if __name__ == "__main__":
HEADERS = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:101.0) Gecko/20100101 Firefox/101.0"
}
REQUESTS = [
Request("https://www.google.com/search?q=query1"),
Request("https://www.google.com/search?q=query2"),
Request("https://www.google.com/search?q=query3"),
Request("https://www.google.com/search?q=query4"),
Request("https://www.google.com/search?q=query5"),
]
loop = asyncio.get_event_loop()
responses = loop.run_until_complete(make_requests(HEADERS, REQUESTS))
print(responses) # [<__main__.Response object at 0x7f4f73e5da30>, <__main__.Response object at 0x7f4f73e734f0>, <__main__.Response object at 0x7f4f73e73790>, <__main__.Response object at 0x7f4f73e5d9d0>, <__main__.Response object at 0x7f4f73e73490>]
loop.close()
If you get 400s after some count of requests, you need to check what headers are sent by the browser that is missed in your request.
I am trying to understand why when I receive 200s (which requires some
JSON serialization and it likely adds some latency). If I am always
awaiting a sleep call of 1.5 seconds per call, why am I still able to
hit rate limits? Is this fault of the API I am hitting, or is there
something intrinsically wrong with my async-await calls?
I'm not sure what you meant by saying "to be able to hit rate limits", but asyncio.sleep should work properly. The script makes the first limited count of concurrent requests (in this case, semaphore allows three concurrent tasks) almost at the same time. After a request is received, it waits for 1.5 sec concurrently and returns the result of the task. The key is concurrency. If you wait with asyncio.sleep for 1.5 sec in 3 different tasks, it will wait 1.5 sec but not 4.5. If you wanted to set delays between requests, you could wait before or after calling the create_task.

Proxies not working in aiohttp (very weird)

I'm trying to create a simple program to check proxies, and about 99% of them timeout. This is 100% NOT an issue with the proxies and I can't seem to figure it out. Here is the code:
import aiohttp,asyncio,requests,collections
import random
proxies = ['http://' +proxy for proxy in requests.get('https://raw.githubusercontent.com/TheSpeedX/PROXY-List/master/http.txt').text.split('\n')] #large proxy list
#proxies = ['http://' +proxy for proxy in requests.get('https://raw.githubusercontent.com/monosans/proxy-list/main/proxies/http.txt').text.split('\n')] #small proxy list
random.shuffle(proxies)
statuses = collections.Counter()
async def make_request(session,proxies_queue):
url = 'https://google.com'
try:
proxy = proxies_queue.get_nowait()
async with session.get(url,proxy = proxy) as resp:
statuses[resp.status]+=1
print('status')
except Exception as e:
print(f'Exception: {e}')
statuses['exception']+=1
async def make_requests(n,delay):
proxies_queue = asyncio.Queue()
for proxy in proxies:
proxies_queue.put_nowait(proxy)
tasks = []
timeout = aiohttp.ClientTimeout(total=10)
async with aiohttp.ClientSession(timeout=timeout) as session:
for i in range(n):
tasks.append(asyncio.create_task(make_request(session,proxies_queue)))
await asyncio.sleep(delay)
for task in asyncio.as_completed(tasks):
await task
asyncio.run(make_requests(100,0.2))
I've had random runs of the program produce 100% 200 status codes, so that's why i'm not trusting anyone who tells me it's the proxies. Also, I just checked the smaller proxy list the moment it was updated, and it produced the same result.

HEAD requests with aiohttp is dog slow

Given a list of 50k websites urls, I've been tasked to find out which of them are up/reachable. The idea is just to send a HEAD request to each URL and look at the status response. From what I hear an asynchronous approach is the way to go and for now I'm using asyncio with aiohttp.
I came up with the following code but the speed is pretty abysmal. 1000 URLs takes approximately 200 seconds on my 10mbit connection. I don't know what speeds to expect but I'm new to asynchronous programming in Python so I figured I've stepped wrong somewhere. As you can see I've tried increasing the number of allowed simultaneous connections to 1000 (up from the default of 100) and the duration for which DNS resolves are kept in the cache; neither to any great effect. The environment has Python 3.6 and aiohttp 3.5.4.
Code review unrelated to the question is also appreciated.
import asyncio
import time
from socket import gaierror
from typing import List, Tuple
import aiohttp
from aiohttp.client_exceptions import TooManyRedirects
# Using a non-default user-agent seems to avoid lots of 403 (Forbidden) errors
HEADERS = {
'user-agent': ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/45.0.2454.101 Safari/537.36'),
}
async def get_status_code(session: aiohttp.ClientSession, url: str) -> Tuple[int, str]:
try:
# A HEAD request is quicker than a GET request
resp = await session.head(url, allow_redirects=True, ssl=False, headers=HEADERS)
async with resp:
status = resp.status
reason = resp.reason
if status == 405:
# HEAD request not allowed, fall back on GET
resp = await session.get(
url, allow_redirects=True, ssl=False, headers=HEADERS)
async with resp:
status = resp.status
reason = resp.reason
return (status, reason)
except aiohttp.InvalidURL as e:
return (900, str(e))
except aiohttp.ClientConnectorError:
return (901, "Unreachable")
except gaierror as e:
return (902, str(e))
except aiohttp.ServerDisconnectedError as e:
return (903, str(e))
except aiohttp.ClientOSError as e:
return (904, str(e))
except TooManyRedirects as e:
return (905, str(e))
except aiohttp.ClientResponseError as e:
return (906, str(e))
except aiohttp.ServerTimeoutError:
return (907, "Connection timeout")
except asyncio.TimeoutError:
return (908, "Connection timeout")
async def get_status_codes(loop: asyncio.events.AbstractEventLoop, urls: List[str],
timeout: int) -> List[Tuple[int, str]]:
conn = aiohttp.TCPConnector(limit=1000, ttl_dns_cache=300)
client_timeout = aiohttp.ClientTimeout(connect=timeout)
async with aiohttp.ClientSession(
loop=loop, timeout=client_timeout, connector=conn) as session:
codes = await asyncio.gather(*(get_status_code(session, url) for url in urls))
return codes
def poll_urls(urls: List[str], timeout=20) -> List[Tuple[int, str]]:
"""
:param timeout: in seconds
"""
print("Started polling")
time1 = time.time()
loop = asyncio.get_event_loop()
codes = loop.run_until_complete(get_status_codes(loop, urls, timeout))
time2 = time.time()
dt = time2 - time1
print(f"Polled {len(urls)} websites in {dt:.1f} seconds "
f"at {len(urls)/dt:.3f} URLs/sec")
return codes
Right now you're launching all your requests at once. Thus probably bottleneck appeared somewhere. To avoid this situation semaphore can be used:
# code
sem = asyncio.Semaphore(200)
async def get_status_code(session: aiohttp.ClientSession, url: str) -> Tuple[int, str]:
try:
async with sem:
resp = await session.head(url, allow_redirects=True, ssl=False, headers=HEADERS)
# code
I tested it following way:
poll_urls([
'http://httpbin.org/delay/1'
for _
in range(2000)
])
And got:
Started polling
Polled 2000 websites in 13.2 seconds at 151.300 URLs/sec
Although it requests a single host, it shows that asynchronous approach does the job: 13 sec. < 2000 sec.
Several more things can be done:
You should play semaphore value to achieve better performance
for your concrete environment and task.
Try to lower timeout from 20 to, let's say, 5
seconds: since you're just doing head request it shouldn't take much
time. If request hangs for 5 seconds there are good chances it won't
be successful at all.
Monitoring your system resources (network/CPU/RAM) while script running
can help to find out if bottleneck is still present.
By the way, did you install aiodns (as doc suggests)?
Does disabling ssl change anything?
Try to enable debug level of logging to see if there is any useful info there
Try to setup client tracing and especially measure time for each request step to see which ones take most time
It's difficult to say more without fully reproducible situation.

Continuously probing multiple ULRs with fewer threads, how to control threads

Background:
I want to monitor say 100 URLs (take a snapshot, and stores it if content is different from previous), my plan is to using urllib.request to scan them every x minutes, say x=5, non-stop.
So I can't use a single for loop and sleep, as I want to kick off detection for ULR1, and then kick off URL2 almost simultaneously.
while TRUE:
for url in urlList:
do_detection()
time.sleep(sleepLength)
Therefore I should be using pool? But I should limit the thread to a small amount that my CPU can handle (can't set to 100 threads if I have 100 ULRs)
My question:
Even I can send the 100 URLs in my list to ThreadPool(4) with four threads, how shall I design to control each thread to handle the 100/4=25 URLs, so the thread probes URL1, sleep(300) before next probe to URL1, and then do URL2.... ULR25 and goes back to URL1...? I don't want to wait for 5 min*25 for a full cycle.
Psuedo code or examples will be great help! I can't find or think a way to make looper() and detector() behave as needed?
(I think How to scrap multiple html page in parallel with beautifulsoup in python? this is close but not exact answer)
Maybe something like this for each thread? I will try to work out how to split the 100 items to each thread now. using pool.map(func, iterable[, chunksize]) takes a list and I can set chunksize to 25.
def one_thread(Url):
For url in Url[0:24]:
CurrentDetect(url)
if 300-timelapsed>0:
remain_sleeping=300-timtlapsed
else:
remain_sleeping=0
sleep (remain_sleeping)
For url in Url[0:24]:
NextDetect()
The non-working code I am trying to write:
import urllib.request as req
import time
def url_reader(url = "http://stackoverflow.com"):
try
f = req.urlopen(url)
print (f.read())
except Exception as err
print (err)
def save_state():
pass
return []
def looper (sleepLength=720,urlList):
for url in urlList: #initial save
Latest_saved.append(save_state(url_reader(url))) # return a list
while TRUE:
pool = ThreadPool(4)
results = pool.map(urllib2.urlopen, urls)
time.sleep(sleepLength) # how to parallel this? if we have 100 urls, then takes 100*20 min to loop?
detector(urlList) #? use last saved status returned to compare?
def detector (urlList):
for url in urlList:
contentFirst=url_reader(url)
contentNext=url_reader(url)
if contentFirst!=contentNext:
save_state(contentFirst)
save_state(contentNext)
You need to install requests,
pip install requests
if you want to use the following code:
# -*- coding: utf-8 -*-
import concurrent.futures
import requests
import queue
import threading
# URL Pool
URLS = [
# Put your urls here
]
# Time interval (in seconds)
INTERVAL = 5 * 60
# The number of worker threads
MAX_WORKERS = 4
# You should set up request headers
# if you want to better evade anti-spider programs
HEADERS = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
#'Host': None,
'If-Modified-Since': '0',
#'Referer': None,
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36',
}
############################
def handle_response(response):
# TODO implement your logics here !!!
raise RuntimeError('Please implement function `handle_response`!')
# Retrieve a single page and report the URL and contents
def load_url(session, url):
#print('load_url(session, url={})'.format(url))
response = session.get(url)
if response.status_code == 200:
# You can refactor this part and
# make it run in another thread
# devoted to handling local IO tasks,
# to reduce the burden of Net IO worker threads
return handle_response(response)
def ThreadPoolExecutor():
return concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WORKERS)
# Generate a session object
def Session():
session = requests.Session()
session.headers.update(HEADERS)
return session
# We can use a with statement to ensure threads are cleaned up promptly
with ThreadPoolExecutor() as executor, Session() as session:
if not URLS:
raise RuntimeError('Please fill in the array `URLS` to start probing!')
tasks = queue.Queue()
for url in URLS:
tasks.put_nowait(url)
def wind_up(url):
#print('wind_up(url={})'.format(url))
tasks.put(url)
while True:
url = tasks.get()
# Work
executor.submit(load_url, session, url)
threading.Timer(interval=INTERVAL, function=wind_up, args=(url,)).start()

Multithread python requests [duplicate]

This question already has answers here:
What is the fastest way to send 100,000 HTTP requests in Python?
(21 answers)
Closed 6 years ago.
For my bachelor thesis I need to grab some data out of about 40000 websites. Therefore I am using python requests, but at the moment it is really slow with getting a response from the server.
Is there anyway to speed it up and keep my current header setting? All tutorials I found where without a header.
Here is my code snipped:
def parse(url):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
for line in r.iter_lines():
...
Well you can use threads since this is a I/O Bound problem. Using the built in threading library is your best choice. I used the Semaphore object to limit how many threads can run at the same time.
import time
import threading
# Number of parallel threads
lock = threading.Semaphore(2)
def parse(url):
"""
Change to your logic, I just use sleep to mock http request.
"""
print 'getting info', url
sleep(2)
# After we done, subtract 1 from the lock
lock.release()
def parse_pool():
# List of all your urls
list_of_urls = ['website1', 'website2', 'website3', 'website4']
# List of threads objects I so we can handle them later
thread_pool = []
for url in list_of_urls:
# Create new thread that calls to your function with a url
thread = threading.Thread(target=parse, args=(url,))
thread_pool.append(thread)
thread.start()
# Add one to our lock, so we will wait if needed.
lock.acquire()
for thread in thread_pool:
thread.join()
print 'done'
You can use asyncio to run tasks concurrently. you can list the url responses (the ones which are completed as well as pending) using the returned value of asyncio.wait() and call coroutines asynchronously. The results will be in an unexpected order, but it is a faster approach.
import asyncio
import functools
async def parse(url):
print('in parse for url {}'.format(url))
info = await #write the logic for fetching the info, it waits for the responses from the urls
print('done with url {}'.format(url))
return 'parse {} result from {}'.format(info, url)
async def main(sites):
print('starting main')
parses = [
parse(url)
for url in sites
]
print('waiting for phases to complete')
completed, pending = await asyncio.wait(parses)
results = [t.result() for t in completed]
print('results: {!r}'.format(results))
event_loop = asyncio.get_event_loop()
try:
websites = ['site1', 'site2', 'site3']
event_loop.run_until_complete(main(websites))
finally:
event_loop.close()
i think it's a good idea to use mutil-thread like threading or multiprocess, or you can use grequests(async requests) due to gevent

Categories

Resources