Memory Utilization keeps on increasing while scraping data using asyncio

Memory Utilization keeps on increasing while scraping data using asyncio - python

I'm scraping data using asyncio and storing the data in a Redis database. My scrap is running fine, but memory utilization on the linux server keeps on increasing till it reaches 100% and then it freezes the server. I have to manually reboot the server and restart the script. I'm using 2 credentials to hit an api endpoint to get data fast as possible.
Here is the sample code:
from asyncio import tasks
from datetime import datetime, timedelta
from multiprocessing import Semaphore
from socket import timeout
import time
import asyncio
from aiohttp import ClientSession
from requests.exceptions import HTTPError
import config
import json
import pandas as pd
from loguru import logger
import pytz
import aioredis
from redis import Redis
RESULTS = []
result_dict = {}
redis = Redis(
host="host",
port=6379,
decode_responses=True,
# ssl=True,
username="default",
password="password",
)
async def get(url, session):
try:
response = await session.request(method="GET", url=url, timeout=1)
except Exception as err:
response = await session.request(method="GET", url=url, timeout=3)
pokemon = await response.json()
return pokemon["name"]
async def run_program(url, session, semaphore):
async with semaphore:
try:
pokemon_name = await get(url, session)
await publish(pokemon_name)
except:
pass
async def main():
header_dict = {
"header1": {
# Request headers
# "API-Key-1": config.PRIMARY_API_KEY,
"Cache-Control": "no-cache",
},
"header2": {
# "API-Key-2": config.SECONDARY_API_KEY,
"Cache-Control": "no-cache",
},
}
semaphore = asyncio.BoundedSemaphore(20)
tasks = []
for key, value in header_dict.items():
# logger.info(value)
async with ClientSession(headers=value) as session:
for i in range(0, 5):
URLS = f"https://pokeapi.co/api/v2/pokemon/{i}"
tasks.append(
asyncio.ensure_future(run_program(URLS, session, semaphore))
)
await asyncio.gather(*tasks)
async def publish(data):
if not data.empty:
try:
keyName = "channelName"
value = data
redis.set(keyName, value)
print("inserting")
except:
pass
else:
pass
while True:
try:
asyncio.run(main(), debug=True)
except Exception as e:
time.sleep(1)
asyncio.run(main(), debug=True)
I want to know why memory consumption is increasing and how to stop it.
Here is the image of memory utilization in percent over time. There is no other script running on the same Linux server except this one.

There are many causes of memory licking.
You're connecting to Redis and never close the connection.
When you are setting timeout=1 it much possibly will raise exceptions which can be the main cause of memory licking (see: Python not catching MemoryError)
The session is created on every iteration over headers. In the example, those are two but not sure about the real headers list size.
tasks are not getting empty after gather is called.
I tried to optimize the code and here is what I got.
import asyncio
import time
from aiohttp import ClientSession
from redis import DataError
from redis import Redis
async def publish(data, redis):
if not data.empty:
try:
redis.set("channelName", data)
except (DataError, Exception):
pass
async def run_program(url, session, headers, semaphore, redis):
async with semaphore:
try:
response = await session.request(method="GET", url=url, headers=headers)
pokemon = await response.json()
pokemon_name = pokemon.get("name")
await publish(pokemon_name, redis)
except:
pass
async def main():
header_dict = {
"header1": {
# Request headers
"Cache-Control": "no-cache",
},
"header2": {
"Cache-Control": "no-cache",
},
}
semaphore = asyncio.BoundedSemaphore(20)
async with ClientSession() as session:
for headers in header_dict.values():
with Redis(host="host", port=6379, decode_responses=True, username="default", password="password") as redis:
await asyncio.gather(*[
asyncio.ensure_future(
run_program(f"https://pokeapi.co/api/v2/pokemon/{i}", session, headers, semaphore, redis)
) for i in range(5)
])
while True:
try:
asyncio.run(main(), debug=True)
except Exception as e:
time.sleep(1)
asyncio.run(main(), debug=True)
All these changes should optimize memory usage.

Related

Proxies not working in aiohttp (very weird)

I'm trying to create a simple program to check proxies, and about 99% of them timeout. This is 100% NOT an issue with the proxies and I can't seem to figure it out. Here is the code:
import aiohttp,asyncio,requests,collections
import random
proxies = ['http://' +proxy for proxy in requests.get('https://raw.githubusercontent.com/TheSpeedX/PROXY-List/master/http.txt').text.split('\n')] #large proxy list
#proxies = ['http://' +proxy for proxy in requests.get('https://raw.githubusercontent.com/monosans/proxy-list/main/proxies/http.txt').text.split('\n')] #small proxy list
random.shuffle(proxies)
statuses = collections.Counter()
async def make_request(session,proxies_queue):
url = 'https://google.com'
try:
proxy = proxies_queue.get_nowait()
async with session.get(url,proxy = proxy) as resp:
statuses[resp.status]+=1
print('status')
except Exception as e:
print(f'Exception: {e}')
statuses['exception']+=1
async def make_requests(n,delay):
proxies_queue = asyncio.Queue()
for proxy in proxies:
proxies_queue.put_nowait(proxy)
tasks = []
timeout = aiohttp.ClientTimeout(total=10)
async with aiohttp.ClientSession(timeout=timeout) as session:
for i in range(n):
tasks.append(asyncio.create_task(make_request(session,proxies_queue)))
await asyncio.sleep(delay)
for task in asyncio.as_completed(tasks):
await task
asyncio.run(make_requests(100,0.2))
I've had random runs of the program produce 100% 200 status codes, so that's why i'm not trusting anyone who tells me it's the proxies. Also, I just checked the smaller proxy list the moment it was updated, and it produced the same result.

Parallelize checking of dead URLs

The question is quite easy: Is it possible to test a list of URLs and store in a list only dead URLs (response code > 400) using asynchronous function?
I previously use requests library to do it and it works great but I have a big list of URLs to test and if I do it sequentially it takes more than 1 hour.
I saw a lot of article on how to make parallels requests using asyncio and aiohttp but I didn't see many things about how to test URLs with these libraries.
Is it possible to do it?

Using multithreading you could do it like this:
import requests
from concurrent.futures import ThreadPoolExecutor
results = dict()
# test the given url
# add url and status code to the results dictionary if GET succeeds but status code >= 400
# also add url to results dictionary if an exception arises with full exception details
def test_url(url):
try:
r = requests.get(url)
if r.status_code >= 400:
results[url] = f'{r.status_code=}'
except requests.exceptions.RequestException as e:
results[url] = str(e)
# return a list of URLs to be checked. probably get these from a file in reality
def get_list_of_urls():
return ['https://facebook.com', 'https://google.com', 'http://google.com/nonsense', 'http://goooglyeyes.org']
def main():
with ThreadPoolExecutor() as executor:
executor.map(test_url, get_list_of_urls())
print(results)
if __name__ == '__main__':
main()

You could do something like this using aiohttp and asyncio.
Could be done more pythonic I guess but this should work.
import aiohttp
import asyncio
urls = ['url1', 'url2']
async def test_url(session, url):
async with session.get(url) as resp:
if resp.status > 400:
return url
async def main():
async with aiohttp.ClientSession() as session:
tasks = []
for url in urls:
tasks.append(asyncio.ensure_future(test_url(session, url)))
dead_urls = await asyncio.gather(*tasks)
print(dead_urls)
asyncio.run(main())

Very basic example, but this is how I would solve it:
from aiohttp import ClientSession
from asyncio import create_task, gather, run
async def TestUrl(url, session):
async with session.get(url) as response:
if response.status >= 400:
r = await response.text()
print(f"Site: {url} is dead, response code: {str(response.status)} response text: {r}")
async def TestUrls(urls):
resultsList: list = []
async with ClientSession() as session:
# Maybe some rate limiting?
partitionTasks: list = [
create_task(TestUrl(url, session))
for url in urls]
resultsList.append(await gather(*partitionTasks, return_exceptions=False))
# do stuff with the results or return?
return(resultsList)
async def main():
urls = []
test = await TestUrls(urls)
if __name__ == "__main__":
run(main())

Try using a ThreadPoolExecutor
from concurrent.futures import ThreadPoolExecutor
import requests
url_list=[
"https://www.google.com",
"https://www.adsadasdad.com",
"https://www.14fsdfsff.com",
"https://www.ggr723tg.com",
"https://www.yyyyyyyyyyyyyyy.com",
"https://www.78sdf8sf5sf45sf.com",
"https://www.wikipedia.com",
"https://www.464dfgdfg235345.com",
"https://www.tttllldjfh.com",
"https://www.qqqqqqqqqq456.com"
]
def check(url):
r=requests.get(url)
if r.status_code < 400:
print(f"{url} is ALIVE")
with ThreadPoolExecutor(max_workers=5) as e:
for url in url_list:
e.submit(check, url)

Multiprocessing could be the better option for your problem.
from multiprocessing import Process
from multiprocessing import Manager
import requests
def checkURLStatus(url, url_status):
res = requests.get(url)
if res.status_code >= 400:
url_status[url] = "Inactive"
else:
url_status[url] = "Active"
if __name__ == "__main__":
urls = [
"https://www.google.com"
]
manager = Manager()
# to store the results for later usage
url_status = manager.dict()
procs = []
for url in urls:
proc = Process(target=checkURLStatus, args=(url, url_status))
procs.append(proc)
proc.start()
for proc in procs:
proc.join()
print(url_status.values())
url_status is a shared variable to store data for separate threads. Refer this page for more info.

Python asyncio retry for 200 response with specific result

I'm in a situation where I need to retry async request even when the request returns 200 response. For some specific cases, I need to check if there's a key in the output. If so, we need to retry. The following sample code (which can be executed in a Jupyter notebook) works for retries whenever the request fails (non-200). How can I tweak this to cater to this particular need?
P.S. Ideally, the response should've been non-200 but this is the option I'm left with.
# load required libraries
import json
import asyncio
import aiohttp
from async_retrying import retry
base_url = "http://localhost:1050/hello?rid="
# async ginger call
#retry(attempts=3)
async def async_ginger_call():
connector = aiohttp.TCPConnector(limit=3)
async with aiohttp.ClientSession(connector=connector) as session:
async with session.post(url, raise_for_status=True, timeout=300) as response:
result = await response.text()
# condition here; if key in result then retry
return json.loads(result)
reqs = 2
tasks = []
connector = aiohttp.TCPConnector(limit=reqs)
async with aiohttp.ClientSession(connector=connector) as session:
for i in range(reqs):
url = base_url + str(i)
# encode sentence
tasks.append(async_ginger_call())
results = await asyncio.gather(*tasks, return_exceptions=True)
Sample flask server code
# sample api
import time
import json
import datetime
from flask import Flask, request
from flask import Response
app = Flask(__name__)
#app.route('/hello', methods=['GET', 'POST'])
def welcome():
rid = request.args.get('rid', default=3, type=int)
valid_response = json.dumps({
"Result": {
"Warnings": [
{
"Code": 1100,
"Message": "A technical error occurred during execution."
}
]
}
}
)
# testing for failure
if rid == 2:
# pass
# return valid_response
return Response("{'Result': ''}", status=500, mimetype='application/json')
return valid_response
if __name__ == '__main__':
app.run(host='0.0.0.0', port=1050)

Concurrently start function and return early

I need to return a response from my FastAPI path operation, but before this I want to send a slow request and I don't need to wait for result of that request, just log errors if there are any. Can I do this by means of Python and FastAPI? I would not like to add Celery to the project.
Here is what I have so far, but it runs synchronously:
import asyncio
import requests
async def slow_request(data):
url = 'https://external.service'
response = requests.post(
url=url,
json=data,
headers={'Auth-Header': settings.API_TOKEN}
)
if not response.status_code == 200:
logger.error('response:', response.status_code)
logger.error('data', data)
#router.post('/order/')
async def handle_order(order: Order):
json_data = {
'order': order
}
task = asyncio.create_task(
slow_request(json_data)
)
await task
return {'body': {'message': 'success'}}

OK, if nobody wants to post an answer here are the solutions:
Solution #1
We can just remove await task line as alex_noname suggested. It will work because create_task schedules task and we are no longer awaiting for its completion.
#router.post('/order/')
async def handle_order(order: Order):
json_data = {
'order': order
}
task = asyncio.create_task(
slow_request(json_data)
)
return {'body': {'message': 'success'}}
Solution #2
I ended up with BackgroundTasks as HTF suggested as I'm already using FastAPI anyway, and this solution seems more neat to me.
#router.post('/order/')
async def handle_order(order: Order, background_tasks: BackgroundTasks):
json_data = {
'order': order
}
background_tasks.add_task(slow_request, json_data)
return {'body': {'message': 'success'}}
This even works without async before def slow_request(data):

The problem is really two-part
the requests library is synchronous, so requests.post(...) will block the event loop until completed
you don't need the result of the web request to respond to the client, but your current handler cannot respond to the client until the request is completed (even if it was async)
Consider separating the request logic off into another process, so it can happen at its own speed.
The key being that you can put work into a queue of some kind to complete eventually, without directly needing the result for response to the client.
You could use an async http request library and some collection of callbacks, multiprocessing to spawn a new process(es), or something more exotic like an independent program (perhaps with a pipe or sockets to communicate).
Maybe something of this form will work for you
import base64
import json
import multiprocessing
URL_EXTERNAL_SERVICE = "https://example.com"
TIMEOUT_REQUESTS = (2, 10) # always set a timeout for requests
SHARED_QUEUE = multiprocessing.Queue() # may leak as unbounded
async def slow_request(data):
SHARED_QUEUE.put(data)
# now returns on successful queue put, rather than request completion
def requesting_loop(logger, Q, url, token):
while True: # expects to be a daemon
data = json.dumps(Q.get()) # block until retrieval (non-daemon can use sentinel here)
response = requests.post(
url,
json=data,
headers={'Auth-Header': token},
timeout=TIMEOUT_REQUESTS,
)
# raise_for_status() --> try/except --> log + continue
if response.status_code != 200:
logger.error('call to {} failed (code={}) with data: {}'.format(
url, response.status_code,
"base64:" + base64.b64encode(data.encode())
))
def startup(): # run me when starting
# do whatever is needed for logger
# create a pool instead if you may need to process a lot of requests
p = multiprocessing.Process(
target=requesting_loop,
kwargs={"logger": logger, "Q": SHARED_QUEUE, "url": URL_EXTERNAL_SERVICE, "token": settings.API_TOKEN},
daemon=True
)
p.start()

Why aiohttp works slower than requests wrapped by run_in_executor?

all!
I need to make about 10,000 requests to the web service, and i expected JSON in response. Since the requests are independent of each other, I want to run them in parallel. I think aiohttp can help me with that. I wrote the following code:
import asyncio
import aiohttp
async def execute_module(session: aiohttp.ClientSession, module_id: str,
post_body: dict) -> dict:
headers = {
'Content-Type': r'application/json',
'Authorization': fr'Bearer {TOKEN}',
}
async with session.post(
fr'{URL}/{module_id}/steps/execute',
headers=headers,
json=post_body,
) as response:
return await response.json()
async def execute_all(campaign_ids, post_body):
async with aiohttp.ClientSession() as session:
return await asyncio.gather(*[
execute_module(session, campaign_id, post_body)
for campaign_id in campaign_ids
])
campaign_ids = ['101', '102', '103'] * 400
post_body = {'inputs': [{"name": "one", "value": 1}]}
print(asyncio.run(execute_all(campaign_ids, post_body)))
P.S. I make 1,200 requests for testing.
Another way to solve it - wrapped requests.post in run_in_executor function. I know it's wrong to use blocking code in the asynchronous function, but it works faster (~ 7 seconds vs. ~ 10 seconds for aiohttp)
import requests
import asyncio
def execute_module(module_id, post_body):
headers = {
'Content-Type': r'application/json',
'Authorization': fr'Bearer {TOKEN}',
}
return requests.post(
fr'{URL}/{module_id}/steps/execute',
headers=headers,
json=post_body,
).json()
async def execute_all(campaign_ids, post_body):
loop = asyncio.get_running_loop()
return await asyncio.gather(*[
loop.run_in_executor(None, execute_module, campaign_id, post_body)
for campaign_id in campaign_ids
])
campaign_ids = ['101', '102', '103'] * 400
post_body = {'inputs': [{"name": "one", "value": 1}]}
print(asyncio.run(execute_all(campaign_ids, post_body)))
What am I doing wrong?

Have you tried uvloop - https://github.com/MagicStack/uvloop? This should increase speed of aiohttp request

loop.run_in_executor(None, ...) runs syncroneous code in a thread pool (multiple threads). Event loop runs code in one thread.
My guess is that waiting for IO shouldn't make much of a difference, but handling the response (i.e. json decoding) does.

It probably due to def execute_module calls don't share requests.Session, i.e. each has its connection pool
https://github.com/psf/requests/blob/main/requests/sessions.py#L831
https://github.com/psf/requests/blob/main/requests/adapters.py#L138
On the other hand, async def execute_module runs with shared aiohttp.ClientSession, having limit of 100 connections https://docs.aiohttp.org/en/latest/http_request_lifecycle.html#how-to-use-the-clientsession
To check that, it may pass a customised aiohttp.TCPConnector to aiohttp.ClientSession with larger limit:
https://docs.aiohttp.org/en/latest/http_request_lifecycle.html#how-to-use-the-clientsession
https://docs.aiohttp.org/en/latest/client_advanced.html#limiting-connection-pool-size

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Memory Utilization keeps on increasing while scraping data using asyncio - python

Related

Proxies not working in aiohttp (very weird)

Parallelize checking of dead URLs

Python asyncio retry for 200 response with specific result

Concurrently start function and return early

Why aiohttp works slower than requests wrapped by run_in_executor?

Categories

Resources