How do I run a function by a single worker, not by all workers in my Gunicorn app? - python

I have a gunicorn flask app on docker. There is a load_artifacts function that pulls artifacts from an S3 bucket and saves it to the docker volume. I only want to run this function once during the deployment of the app. But I have multiple gunicorn workers (4) and threads. Looking in the logs it seems like each of the workers is running this function, but I only want one of these workers to run this process.
Is there a way I can have multiple workers but only have a single worker run my function during deployment?
I tried running with a single worker but I need multiple workers for performance reasons
I tried adding logic that checks if the s3 file is already exists in the volume, but the file is quite big (400mb) so this doesn't stop other works from triggering the process.

I'm sure it's not foolproof but I have used a file as a lock for this purpose in the past:
import contextlib
from pathlib import Path
class FileLockContextManager:
"""Create a file on enter and remove that file on exit."""
def __init__(self, filename) -> None:
self.filename = filename
self.lock_file = Path(self.filename)
def __enter__(self) -> None:
# Might raise FileExistsError
self.lock_file.touch(exist_ok=False)
def __exit__(
self,
exc_type,
exc_val,
exc_tb,
) -> None:
logger.warning("Deleting lock file")
self.lock_file.unlink()
# Use it like so
try:
with FileLockContextManager("/my/lock.file"):
download_from_s3()
except FileExistsError:
print("Not the first worker to touch the file")
I would be interested to know whether there are any downsides to this or any ways to improve it.

Related

What is a lifecycle of Aiohttp application when used together with Gunicorn?

A project I'm working on uses Gunicorn and Aiohttp to implement a web server. It all starts with something like this:
# main.py
class GunicornApp(gunicorn.app.base.Application):
def __init__(self, ...):
...
def load_config(self):
...
def load(self):
return create_aiohttp_app(...)
if __name__ == "__main__":
GunicornApp(...).run()
where create_aiohttp_app is defined as something like this:
def create_aiohttp_app(...) -> web.Application:
app = web.Application(...)
app.router.add_get(...)
app.on_startup.append(start_app)
app.on_cleanup.append(stop_app)
return app
start_app performs certain initialisation actions and then launches an async task which is supposed to execute indefinitely, thus becoming the server's main payload:
async def start_app(web.Application) -> None:
app["payload_obj"] = PayloadClass(...)
app["payload_task"] = create_task(app["payload_obj"].run()) # infinite loop inside
stop_app just does some cleanup:
async def stop_app(app: web.Application) -> None:
app["payload_task"].cancel()
With all of the above, there are a few things that I would like to understand:
How many times is GunicornApp.load() supposed to be called? Is this called once per Gunicorn worker, or is it called once during the whole lifetime of the Gunicorn application? In other words, how many web.Application are expected to be created?
What's the expected lifetime of a web.Application instance returned by create_aiohttp_app? When is it disposed of? Does it live until the Gunicorn worker executing it stays alive, or can it outlive it?
How many start_app/stop_app cycles can there be for a web.Application instance? Are these methods only called once each or many times?
What exactly is the relationship between Gunicorn workers and web.Application instances? Does web.Application maintain an infinite event loop inside (thus ensuring that it runs forever and app["payload_task"] doesn't go out of scope, or is there something more complex here?

How can I provide shared state to my Flask app with multiple workers without depending on additional software?

I want to provide shared state for a Flask app which runs with multiple workers, i. e. multiple processes.
To quote this answer from a similar question on this topic:
You can't use global variables to hold this sort of data. [...] Use a data source outside of Flask to hold global data. A database, memcached, or redis are all appropriate separate storage areas, depending on your needs.
(Source: Are global variables thread safe in flask? How do I share data between requests?)
My question is on that last part regarding suggestions on how to provide the data "outside" of Flask. Currently, my web app is really small and I'd like to avoid requirements or dependencies on other programs. What options do I have if I don't want to run Redis or anything else in the background but provide everything with the Python code of the web app?
If your webserver's worker type is compatible with the multiprocessing module, you can use multiprocessing.managers.BaseManager to provide a shared state for Python objects. A simple wrapper could look like this:
from multiprocessing import Lock
from multiprocessing.managers import AcquirerProxy, BaseManager, DictProxy
def get_shared_state(host, port, key):
shared_dict = {}
shared_lock = Lock()
manager = BaseManager((host, port), key)
manager.register("get_dict", lambda: shared_dict, DictProxy)
manager.register("get_lock", lambda: shared_lock, AcquirerProxy)
try:
manager.get_server()
manager.start()
except OSError: # Address already in use
manager.connect()
return manager.get_dict(), manager.get_lock()
You can assign your data to the shared_dict to make it accessible across processes:
HOST = "127.0.0.1"
PORT = 35791
KEY = b"secret"
shared_dict, shared_lock = get_shared_state(HOST, PORT, KEY)
shared_dict["number"] = 0
shared_dict["text"] = "Hello World"
shared_dict["array"] = numpy.array([1, 2, 3])
However, you should be aware of the following circumstances:
Use shared_lock to protect against race conditions when overwriting values in shared_dict. (See Flask example below.)
There is no data persistence. If you restart the app, or if the main (the first) BaseManager process dies, the shared state is gone.
With this simple implementation of BaseManager, you cannot directly edit nested values in shared_dict. For example, shared_dict["array"][1] = 0 has no effect. You will have to edit a copy and then reassign it to the dictionary key.
Flask example:
The following Flask app uses a global variable to store a counter number:
from flask import Flask
app = Flask(__name__)
number = 0
#app.route("/")
def counter():
global number
number += 1
return str(number)
This works when using only 1 worker gunicorn -w 1 server:app. When using multiple workers gunicorn -w 4 server:app it becomes apparent that number is not a shared state but individual for each worker process.
Instead, with shared_dict, the app looks like this:
from flask import Flask
app = Flask(__name__)
HOST = "127.0.0.1"
PORT = 35791
KEY = b"secret"
shared_dict, shared_lock = get_shared_state(HOST, PORT, KEY)
shared_dict["number"] = 0
#app.route("/")
def counter():
with shared_lock:
shared_dict["number"] += 1
return str(shared_dict["number"])
This works with any number of workers, like gunicorn -w 4 server:app.
your example is a bit magic for me! I'd suggest reusing the magic already in the multiprocessing codebase in the form of a Namespace. I've attempted to make the following code compatible with spawn servers (i.e. MS Windows) but I only have access to Linux machines, so can't test there
start by pulling in dependencies and defining our custom Manager and registering a method to get out a Namespace singleton:
from multiprocessing.managers import BaseManager, Namespace, NamespaceProxy
class SharedState(BaseManager):
_shared_state = Namespace(number=0)
#classmethod
def _get_shared_state(cls):
return cls._shared_state
SharedState.register('state', SharedState._get_shared_state, NamespaceProxy)
this might need to be more complicated if creating the initial state is expensive and hence should only be done when it's needed. note that the OPs version of initialising state during process startup will cause everything to reset if gunicorn starts a new worker process later, e.g. after killing one due to a timeout
next I define a function to get access to this shared state, similar to how the OP does it:
def shared_state(address, authkey):
manager = SharedState(address, authkey)
try:
manager.get_server() # raises if another server started
manager.start()
except OSError:
manager.connect()
return manager.state()
though I'm not sure if I'd recommend doing things like this. when gunicorn starts it spawns lots of processes that all race to run this code and it wouldn't surprise me if this could go wrong sometimes. also if it happens to kill off the server process (because of e.g. a timeout) every other process will start to fail
that said, if we wanted to use this we would do something like:
ss = shared_state('server.sock', b'noauth')
ss.number += 1
this uses Unix domain sockets (passing a string rather than a tuple as an address) to lock this down a bit more.
also note this has the same race conditions as the OP's code: incrementing a number will cause the value to be transferred to the worker's process, which is then incremented, and sent back to the server. I'm not sure what the _lock is supposed to be protecting, but I don't think it'll do much

P4Python: use multiple threads that request perforce information at the same time

I've been working on a "crawler" of sorts that goes through our repository, and lists directories and files as it goes. For every directory it enounters, it creates a thread that does the same for that directory and so on, recursively. Effectively this creates a very short-lived thread for every directory encountered in the repos. ( it doesn't take very long to request information on just one path, there are just tens of thousands of them )
The logic looks as follows:
import threading
import perforce as Perforce #custom perforce class
from pathlib import Path
p4 = Perforce()
p4.connect()
class Dir():
def __init__(self, path):
self.dirs = []
self.files = []
self.path = path
self.crawlers = []
def build_crawler(self):
worker = Crawler(self)
# append to class variable to keep it from being deleted
self.crawlers.append(worker)
worker.start()
class Crawler(threading.Thread):
def __init__(self, dir):
threading.Thread.__init__(self)
self.dir = dir
def run(self):
depotdirs = p4.getdepotdirs(self.dir.path)
depotfiles = p4.getdepotfiles(self.dir.path)
for p in depotdirs:
if Path(p).is_dir():
_d = Dir(self.dir, p)
self.dir.dirs.append(_d)
for p in depotfiles:
if Path(p).is_file():
f = File(p) # File is like Dir, but with less stuff, just a path.
self.dir.files.append(f)
for dir in self.dir.dirs:
dir.build_crawler()
for worker in d.crawlers:
worker.join()
Obviously this is not complete code, but it represents what I'm doing.
My question really is whether I can create an instance of this Perforce class in the __init__ method of the Crawler class, so that requests can be done separately. Right now, I have to call join() on the created threads so that they wait for completion, to avoid concurrent perforce calls.
I've tried it out, but it seems like there is a limit to how many connections you can create: I don't have a solid number, but somewhere along the line Perforce just started straight up refusing connections, which I presume is due to the number of concurrent requests.
Really what I'm asking I suppose is two-fold: is there a better way of creating a data model representing a repos with tens of thousands of files than the one I'm using, and is what I'm trying to do possible, and if so, how.
Any help would be greatly appreciated :)
I found out how to do this (it's infuriatingly simple, as with all simple solutions to overly complicated problems):
To build a data model that contains Dir and File classes representing a whole depot with thousands of files, just call p4.run("files", "-e", path + "\\...").
This will return a list of every file in path, recursively. From there all you need to do is iterate over every returned path and construct your data model from there.
Hope this helps someone at some point.

Celery transfer command line arguments to Task

I am struggling with transfering additional command line arguments to celery task. I can set the desired attribute in bootstep however the same attribute is emtpy when accessed directly from task (I guess it gets overriden)
class Arguments(bootsteps.Step):
def __init__(self, worker, environment, **options):
ArgumentTask.args = {'environment': environment}
# this works
print ArgumentTask.args
Here is the custom task
class ArgumentTask(Task):
abstract = True
_args = {}
#property
def args(self):
return self._args
#args.setter
def args(self, value):
self._args.update(value)
And actual task
#celery.task(base = ArgumentTask, bind = True, name = 'jobs.send')
def send(self):
# this prints empty dictionary
print self.args
Do I need to use some additional persistence layer, eg. persistent objects or am I missing something really obvious?
Similar question
It does not seem to be possible. The reason for that is that your task could be consumed anywhere by any consumer of the queue and each consumer having different command line parameters and therefore it's processing should not depend on workers configuration.
If your problem is to manage environment dev/prod this is the way we managed it in our project:
Each environment is jailed in it's venv having a configuration so that the project is self aware of it's environment(in our case it's just db links in configuration that changes). And each environment has its queues and celery workers launched with this command:
/path/venv/bin/celery worker -A async.myapp --workdir /path -E -n celery-name#server -Ofair
Hope it helped.
If you really want to dig hard on that, each task can access a .control which allows to launch control operations on celery (like some monitoring). But I didn't find anything helpful there.

python Redis Connections

I am using Redis server with python.
My application is multithreaded ( I use 20 - 32 threads per process) and I also
I run the app in different machines.
I have noticed that sometimes Redis cpu usage is 100% and Redis server became unresponsive/slow.
I would like to use per application 1 Connection Pool of 4 connections in total.
So for example, if I run my app in 20 machines at maximum, there should be
20*4 = 80 connections to the redis Server.
POOL = redis.ConnectionPool(max_connections=4, host='192.168.1.1', db=1, port=6379)
R_SERVER = redis.Redis(connection_pool=POOL)
class Worker(Thread):
def __init__(self):
self.start()
def run(self):
while True:
key = R_SERVER.randomkey()
if not key: break
value = R_SERVER.get(key)
def _do_something(self, value):
# do something with value
pass
if __name__ = '__main__':
num_threads = 20
workers = [Worker() for _ in range(num_threads)]
for w in workers:
w.join()
The above code should run the 20 threads that get a connection from the connection pool of max size 4 when a command is executed.
When the connection is released?
According to this code (https://github.com/andymccurdy/redis-py/blob/master/redis/client.py):
#### COMMAND EXECUTION AND PROTOCOL PARSING ####
def execute_command(self, *args, **options):
"Execute a command and return a parsed response"
pool = self.connection_pool
command_name = args[0]
connection = pool.get_connection(command_name, **options)
try:
connection.send_command(*args)
return self.parse_response(connection, command_name, **options)
except ConnectionError:
connection.disconnect()
connection.send_command(*args)
return self.parse_response(connection, command_name, **options)
finally:
pool.release(connection)
After the execution of each command, the connection is released and gets back to the pool
Can someone verify that I have understood the idea correct and the above example code will work as described?
Because when I see the redis connections, there are always more than 4.
EDIT: I just noticed in the code that the function has a return statement before the finally. What is the purpose of finally then?
As Matthew Scragg mentioned, the finally clause is executed at the end of the test. In this particular case it serves to release the connection back to the pool when finished with it instead of leaving it hanging open.
As to the unresponsiveness, look to what your server is doing. What is the memory limit of your Redis instance? How often are you saving to disk? Are you running on a Xen based VM such as an AWS instance? Are you running replication, and if so how many slaves and are they in a good state or are they frequently calling for a full resync of data? Are any of your commands "save"?
You can answer some of these questions by using the command line interface. For example
redis-cli info persistence will tell you information about the process of saving to disk, redis-cli info memory will tell you about your memory consumption.
When obtaining the persistence information you want to specifically look at rdb_last_bgsave_status and rdb_last_bgsave_time_sec. These will tell you if the last save was successful and how long it took. The longer it takes the higher the chances are you are running into resource issues and the higher the chance you will encounter slowdowns which can appear as unresponsiveness.
Final block will always run though there is an return statement before it. You may have a look at redis-py/connection.py , pool.release(connection) only put the connection to available-connections pool, So the connection is still alive.
About redis server cpu usage, your app will always send request and has no breaks or sleep, so it just use more and more cpus , but not memory . and cpu usage has no relation with open file numbers.

Categories

Resources