Slow EC2 Performance with Python Threading? - python

I'm using Python threading in a REST endpoint so that the endpoint can launch a thread, and then immediately return a 200 OK to the client while the thread runs. (The client then polls server state to track the progress of the thread).
The code runs in 7 seconds on my local dev system, but takes 6 minutes on an AWS EC2 m5.large.
Here's what the code looks like:
import threading
[.....]
# USES THREADING
# https://stackoverflow.com/a/1239108/364966
thr = threading.Thread(target=score, args=(myArgs1, myArgs2), kwargs={})
thr.start() # Will run "foo"
thr.is_alive() # Will return whether function is running currently
data = {'now creating test scores'}
return Response(data, status=status.HTTP_200_OK)
I turned off threading to test if that was the cause of the slowdown, like this:
# USES THREADING
# https://stackoverflow.com/a/1239108/364966
# thr = threading.Thread(target=score, args=(myArgs1, myArgs2), kwargs={})
# thr.start() # Will run "foo"
# thr.is_alive() # Will return whether function is running currently
# FOR DEBUGGING - SKIP THREADING TO SEE IF THAT'S WHAT'S SLOWING THINGS DOWN ON EC2
score(myArgs1, myArgs2)
data = {'now creating test scores'}
return Response(data, status=status.HTTP_200_OK)
...and it ran in 5 seconds on EC2. This proves that something about how I'm handling threads on EC2 is the cause of the slowdown.
Is there something I need to configure on EC2 to better support Python threads?

An AWS-certified consultant has advised me that EC2 is known to be slow in execution of Python threads, and to use AWS Lambda functions instead.

Related

What is a lifecycle of Aiohttp application when used together with Gunicorn?

A project I'm working on uses Gunicorn and Aiohttp to implement a web server. It all starts with something like this:
# main.py
class GunicornApp(gunicorn.app.base.Application):
def __init__(self, ...):
...
def load_config(self):
...
def load(self):
return create_aiohttp_app(...)
if __name__ == "__main__":
GunicornApp(...).run()
where create_aiohttp_app is defined as something like this:
def create_aiohttp_app(...) -> web.Application:
app = web.Application(...)
app.router.add_get(...)
app.on_startup.append(start_app)
app.on_cleanup.append(stop_app)
return app
start_app performs certain initialisation actions and then launches an async task which is supposed to execute indefinitely, thus becoming the server's main payload:
async def start_app(web.Application) -> None:
app["payload_obj"] = PayloadClass(...)
app["payload_task"] = create_task(app["payload_obj"].run()) # infinite loop inside
stop_app just does some cleanup:
async def stop_app(app: web.Application) -> None:
app["payload_task"].cancel()
With all of the above, there are a few things that I would like to understand:
How many times is GunicornApp.load() supposed to be called? Is this called once per Gunicorn worker, or is it called once during the whole lifetime of the Gunicorn application? In other words, how many web.Application are expected to be created?
What's the expected lifetime of a web.Application instance returned by create_aiohttp_app? When is it disposed of? Does it live until the Gunicorn worker executing it stays alive, or can it outlive it?
How many start_app/stop_app cycles can there be for a web.Application instance? Are these methods only called once each or many times?
What exactly is the relationship between Gunicorn workers and web.Application instances? Does web.Application maintain an infinite event loop inside (thus ensuring that it runs forever and app["payload_task"] doesn't go out of scope, or is there something more complex here?

fastAPI + APScheduler not working asyncronously

I am trying to set up a fastAPI app doing the following:
Accept messages as post requests and put them in a queue;
A background job is, from time to time, pulling messages (up to a certain batch size) from the queue, processing them in a batch, and storing results in a dictionary;
The app is retrieving results from the dictionary and sending them back "as soon as" they are done.
To do so, I've set up a background job with apscheduler communicating via a queue trying to make a simplified version of this post: https://levelup.gitconnected.com/fastapi-how-to-process-incoming-requests-in-batches-b384a1406ec. Here is the code of my app:
import queue
import uuid
from asyncio import sleep
import uvicorn
from pydantic import BaseModel
from fastapi import FastAPI
from apscheduler.schedulers.asyncio import AsyncIOScheduler
app = FastAPI()
app.input_queue = queue.Queue()
app.output_dict = {}
app.queue_limit = 2
def upper_messages():
for i in range(app.queue_limit):
try:
obj = app.input_queue.get_nowait()
app.output_dict[obj['request_id']] = obj['text'].upper()
except queue.Empty:
pass
app.scheduler = AsyncIOScheduler()
app.scheduler.add_job(upper_messages, 'interval', seconds=5)
app.scheduler.start()
async def get_result(request_id):
while True:
if request_id in app.output_dict:
result = app.output_dict[request_id]
del app.output_dict[request_id]
return result
await sleep(0.001)
class Payload(BaseModel):
text: str
#app.post('/upper')
async def upper(payload: Payload):
request_id = str(uuid.uuid4())
app.input_queue.put({'text': payload.text, 'request_id': request_id})
return await get_result(request_id)
if __name__ == "__main__":
uvicorn.run(app)
however it's not really running asynchronously; if I invoke the following test script:
from time import time
import requests
texts = [
'text1',
'text2',
'text3',
'text4'
]
time_start = time()
for text in texts:
result = requests.post('http://127.0.0.1:8000/upper', json={'text': text})
print(result.text, time() - time_start)
the messages do get processed, but the whole processing takes 15-20 seconds, the output being something like:
"TEXT1" 2.961090087890625
"TEXT2" 7.96642279624939
"TEXT3" 12.962305784225464
"TEXT4" 17.96261429786682
I was instead expecting the whole processing to take 5-10 seconds (after less than 5 seconds the first two messages should be processed, and the other two more or less exactly 5 seconds later). It seems instead that the second message is not being put to the queue until the first one is processed - i.e. the same as if I were just using a single thread.
Questions:
Does anyone know how to modify the code above so that all the incoming messages are put to the queue immediately upon receiving them?
[bonus question 1]: The above holds true if I run the script (say, debug_app.py) from the command line via uvicorn debug_app:app. But if I run it with python3 debug_app.py no message is returned at all. Messages are received (doing CTRL+C results in Waiting for connections to close. (CTRL+C to force quit)) but never processed.
[bonus question 2]: Another thing I don't understand is why, if I remove the line await sleep(0.001) inside the definition of get_result, the behaviour gets even worse: no matter what I do, the app freezes, I cannot terminate it (i.e. neither CTRL+C nor kill work), I have to send a sigkill (kill -9) to stop it.
Background
If you are wondering why I am doing this, like in the blog post linked above, the purpose is to do efficient deep learning inference. The model I have takes (roughly) the same time processing one or a dozen requests at the same time, so batching can dramatically increase throughput. I first tried setting up a fastAPI frontend + RabbitMQ + Flask backend pipeline, and it worked, but the overhead of the complicated setup (and/or my inability of working with it) made the overhead heavier than the time it just took to compute the model, nullifying the gain... so I'm first trying to get a minimalistic version to work. The upper_messages method in this toy example will become either directly invocation of the model (if this computational-heavier step is not blocking incoming connections too much) or an async call to another process actually doing the computations - I'll see about that later...
... after looking better into it, it looks like the application was indeed working as I wanted it to, my error was in the way I tested it...
Indeed, when sending a POST request to the uvicorn server, the client is left waiting for an answer to come - which is intended behaviour. Of course, this also means, however, is that the next request is not sent until the first answer is collected. So the server is not batching them because there's nothing to batch!
To test this correctly, I slightly altered the test.py script to:
from time import time
import requests
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('prefix')
args = parser.parse_args()
texts = [
'text1',
'text2',
'text3',
'text4'
]
texts = [args.prefix + '_' + t for t in texts]
time_start = time()
for text in texts:
result = requests.post('http://127.0.0.1:8000/upper', json={'text': text})
print(result.text, time() - time_start)
And run it in multiple processes via:
python3 test.py user1 & python3 test.py user2 & python3 test.py user3
The output is now as expected, with pairs of messages (from different users!) being processed in a batch (and the exact order is a bit randomized, although the same user gets, of course, answers in the order of the requests it made):
"USER1_TEXT1" 4.340522766113281
"USER3_TEXT1" 4.340718030929565
"USER2_TEXT1" 9.334393978118896
"USER1_TEXT2" 9.340892553329468
"USER3_TEXT2" 14.33926010131836
"USER2_TEXT2" 14.334421396255493
"USER1_TEXT3" 19.339791774749756
"USER3_TEXT3" 19.33999013900757
"USER1_TEXT4" 24.33989715576172
"USER2_TEXT3" 24.334784030914307
"USER3_TEXT4" 29.338693857192993
"USER2_TEXT4" 29.333901166915894
I'm leaving the question open (and not accepting my own answer) because for the "bonus questions" above (about the application becoming frozen) I still don't have an answer.

Python - Why does aws lambda run multiple threads so slowly?

i am using AWS Lambda to run a python code
def get_likers(link):
#scrapes a site
def lambda_handler(event, context):
text = #gets link from a telegram bot message
checkt = threading.Thread(target=get_likers, args=[text])
checkt1 = threading.Thread(target=get_likers, args=["here's a link"])
checkt2 = threading.Thread(target=get_likers, args=["here's a link"])
checkt3 = threading.Thread(target=get_likers, args=["here's a link"])
checkt4 = threading.Thread(target=get_likers, args=["here's a link"])
checks = []
checks.append(checkt)
checks.append(checkt1)
checks.append(checkt2)
checks.append(checkt3)
checks.append(checkt4)
for thread in checks:
thread.start()
for thread in checks:
thread.join()
return {'statusCode': 200}
It should run the threads simoultaneously and finish fast, but while if i do this with just 1 thread it takes 3 seconds, with 5 threads it takes 7 seconds and with 20 threads 60 + seconds. Why is this happening? Each thread is kinda light and the data to scrape is the same for each thread
CPython threads I/O bound tasks well, but CPU bound tasks poorly.
And if you add one CPU bound thread to an otherwise I/O bound set of threads, they all start having problems.
I don't know the AWS Lambda specifics, but this could be what you're seeing.
Note that Python, the language, threads fine. It's implementations like CPython and Pypy that do not thread well. Jython and IronPython thread well.

Can Celery pass a Status Update to a non-Blocking Caller?

I am using Celery to asynchronously perform a group of operations. There are a lot of these operations and each may take a long time, so rather than send the results back in the return value of the Celery worker function, I'd like to send them back one at a time as custom state updates. That way the caller can implement a progress bar with a change state callback, and the return value of the worker function can be of constant size rather than linear in the number of operations.
Here is a simple example in which I use the Celery worker function add_pairs_of_numbers to add a list of pairs of numbers, sending back a custom status update for every added pair.
#!/usr/bin/env python
"""
Run worker with:
celery -A tasks worker --loglevel=info
"""
from celery import Celery
app = Celery("tasks", broker="pyamqp://guest#localhost//", backend="rpc://")
#app.task(bind=True)
def add_pairs_of_numbers(self, pairs):
for x, y in pairs:
self.update_state(state="SUM", meta={"x":x, "y":y, "x+y":x+y})
return len(pairs)
def handle_message(message):
if message["status"] == "SUM":
x = message["result"]["x"]
y = message["result"]["y"]
print(f"Message: {x} + {y} = {x+y}")
def non_looping(*pairs):
task = add_pairs_of_numbers.delay(pairs)
result = task.get(on_message=handle_message)
print(result)
def looping(*pairs):
task = add_pairs_of_numbers.delay(pairs)
print(task)
while True:
pass
if __name__ == "__main__":
import sys
if sys.argv[1:] and sys.argv[1] == "looping":
looping((3,4), (2,7), (5,5))
else:
non_looping((3,4), (2,7), (5,5))
If you run just ./tasks it executes the non_looping function. This does the standard Celery thing: makes a delayed call to the worker function and then uses get to wait for the result. A handle_message callback function prints each message, and the number of pairs added is returned as the result. This is what I want.
$ ./task.py
Message: 3 + 4 = 7
Message: 2 + 7 = 9
Message: 5 + 5 = 10
3
Though the non-looping scenario is sufficient for this simple example, the real world task I'm trying to accomplish is processing a batch of files instead of adding pairs of numbers. Furthermore the client is a Flask REST API and therefore cannot contain any blocking get calls. In the script above I simulate this constraint with the looping function. This function starts the asynchronous Celery task, but does not wait for a response. (The infinite while loop that follows simulates the web server continuing to run and handle other requests.)
If you run the script with the argument "looping" it runs this code path. Here it immediately prints the Celery task ID then drops into the infinite loop.
$ ./tasks.py looping
a39c54d3-2946-4f4e-a465-4cc3adc6cbe5
The Celery worker logs show that the add operations are performed, but the caller doesn't define a callback function, so it never gets the results.
(I realize that this particular example is embarrassingly parallel, so I could use chunks to divide this up into multiple tasks. However, in my non-simplified real-world case I have tasks that cannot be parallelized.)
What I want is to be able to specify a callback in the looping scenario. Something like this.
def looping(*pairs):
task = add_pairs_of_numbers.delay(pairs, callback=handle_message) # There is no such callback.
print(task)
while True:
pass
In the Celery documentation and all the examples I can find online (for example this), there is no way to define a callback function as part of the delay call or its apply_async equivalent. You can only specify one as part of a get callback. That's making me think this is an intentional design decision.
In my REST API scenario I can work around this by having the Celery worker process send a "status update" back to the Flask server in the form of an HTTP post, but this seems weird because I'm starting to replicate messaging logic in HTTP that already exists in Celery.
Is there any way to write my looping scenario so that the caller receives callbacks without making a blocking call, or is that explicitly forbidden in Celery?
It's a pattern that is not supported by celery although you can (somewhat) trick it out by posting custom state updates to your task as described here.
Use update_state() to update a task’s state:.
def upload_files(self, filenames):
for i, file in enumerate(filenames):
if not self.request.called_directly:
self.update_state(state='PROGRESS',
meta={'current': i, 'total': len(filenames)})```
The reason that celery does not support such a pattern is that task producers (callers) are strongly decoupled from the task consumers (workers) with the only communications between the two being the broker to support communication from producers to consumers and the result backend supporting communications from consumers to producers. The closest you can get currently is with polling a task state or writing a custom result backend that will allow you to post events either via AMP RPC or redis subscriptions.

python Redis Connections

I am using Redis server with python.
My application is multithreaded ( I use 20 - 32 threads per process) and I also
I run the app in different machines.
I have noticed that sometimes Redis cpu usage is 100% and Redis server became unresponsive/slow.
I would like to use per application 1 Connection Pool of 4 connections in total.
So for example, if I run my app in 20 machines at maximum, there should be
20*4 = 80 connections to the redis Server.
POOL = redis.ConnectionPool(max_connections=4, host='192.168.1.1', db=1, port=6379)
R_SERVER = redis.Redis(connection_pool=POOL)
class Worker(Thread):
def __init__(self):
self.start()
def run(self):
while True:
key = R_SERVER.randomkey()
if not key: break
value = R_SERVER.get(key)
def _do_something(self, value):
# do something with value
pass
if __name__ = '__main__':
num_threads = 20
workers = [Worker() for _ in range(num_threads)]
for w in workers:
w.join()
The above code should run the 20 threads that get a connection from the connection pool of max size 4 when a command is executed.
When the connection is released?
According to this code (https://github.com/andymccurdy/redis-py/blob/master/redis/client.py):
#### COMMAND EXECUTION AND PROTOCOL PARSING ####
def execute_command(self, *args, **options):
"Execute a command and return a parsed response"
pool = self.connection_pool
command_name = args[0]
connection = pool.get_connection(command_name, **options)
try:
connection.send_command(*args)
return self.parse_response(connection, command_name, **options)
except ConnectionError:
connection.disconnect()
connection.send_command(*args)
return self.parse_response(connection, command_name, **options)
finally:
pool.release(connection)
After the execution of each command, the connection is released and gets back to the pool
Can someone verify that I have understood the idea correct and the above example code will work as described?
Because when I see the redis connections, there are always more than 4.
EDIT: I just noticed in the code that the function has a return statement before the finally. What is the purpose of finally then?
As Matthew Scragg mentioned, the finally clause is executed at the end of the test. In this particular case it serves to release the connection back to the pool when finished with it instead of leaving it hanging open.
As to the unresponsiveness, look to what your server is doing. What is the memory limit of your Redis instance? How often are you saving to disk? Are you running on a Xen based VM such as an AWS instance? Are you running replication, and if so how many slaves and are they in a good state or are they frequently calling for a full resync of data? Are any of your commands "save"?
You can answer some of these questions by using the command line interface. For example
redis-cli info persistence will tell you information about the process of saving to disk, redis-cli info memory will tell you about your memory consumption.
When obtaining the persistence information you want to specifically look at rdb_last_bgsave_status and rdb_last_bgsave_time_sec. These will tell you if the last save was successful and how long it took. The longer it takes the higher the chances are you are running into resource issues and the higher the chance you will encounter slowdowns which can appear as unresponsiveness.
Final block will always run though there is an return statement before it. You may have a look at redis-py/connection.py , pool.release(connection) only put the connection to available-connections pool, So the connection is still alive.
About redis server cpu usage, your app will always send request and has no breaks or sleep, so it just use more and more cpus , but not memory . and cpu usage has no relation with open file numbers.

Categories

Resources