I have a flask server which consists of multiple methods. I am aiming to automate the execution of these methods by using Airflow.
I am thinking of using the following steps:-
Setting up Airflow by defining multiple DAGS to call the relevant flask methods in a pipeline.
Deploying Flask Server.
Deploying Airflow (using docker-compose).
Mainly, I am thinking to seperate the Airflow and flask servers independently. Do you think this is a good plan? Any other suggestions would be highly appreciated.
It depends on a couple of things.
Can you run the methods from inside Airflow? For security reasons it is often required to keep some functionality in a different environment/cluster. Reasons for that could be the required database access that you want to give to the Airflow environment.
Is this functionality/methods also invoked from other locations or is it solely for Airflow?
What other functionality does the flask server have that you can't live without?
Are there python dependency conflicts? Even in that case you could use the VirtualEnvOperator of Airflow.
If there is no answer here that is completely blocking you from invoking these methods from inside Airflow, I would vote to do them completely inside Airflow. This will reduce coupling and also reduce the maintenance burden for you in the long term. Besides, Airflow will prevent you from needing to worry about a lot of things, like connectivity, exception codes and callbacks for when something went wrong.
Related
I am working on a project that allows users to upload a python script to an API and run it on a schedule. Currently, I'm trying to figure out a way to limit the functionality of the script so that it cannot access local files, mess with the flask server running the API, etc. Do you have any ideas on how I can achieve this? Is there anyway to make it so only specific libraries are available for importing?
Running other scripts on your server is serious security issue. If you are trying to deploy Python interpreter on your web application, you can try with something like judge0 - GitHub. It is free if you deploy it yourself and it will run scripts safely inside containers.
The simplest way is to ensure the user running the script is not root, but a user specifically designed for this task (e.g. part of a group that can only read and not write or execute). This means at minimum you should ensure all files have the appropriate mode. Then you can just use a pipe or something to run the script.
Alternatively, you could use a runtime that’s not “local”, like a VM or compute service (AWS lambda, etc). The latter would be simplest, and there’s lots of vendors who offer compute service with programmatic api.
I am trying to wrap my head around how DAGs should actually be executed in docker compose environment, when they are dependent on other service (separate python venv) defined in the compose file.
I have setup airflow via docker compose as mentioned in official documentation. Also, I have added a Django service, which has its own dependencies.
Now, i would like to have a DAG, that executes python script using that Django's service python environment (It also uses Django's models. Not sure if that's relevant).
The only way I see it working is with DockerOperator as described here. I managed to setup and execute the test DAG mentioned there, however when I try to run the real task, it fails due to networking issues. Iam quite confident i can solve that issue, but setting everything this way just seems like way too much hassle.
So, in the end I guess Iam wondering what the ideal architecture should when using Airflow via compose? Should the base airflow image be extended with my Django service (creating one hell of a big image) or is there a better way?
You can use PythonVirtualEnvOperator (https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html#pythonvirtualenvoperator) - but it will recreate django virtualenv every time task is run so not ideal
Another option will be to run DockerOperator or KubernetesPodOperator (if you use Kubernetes) and have separate image with Django installed (or even base Django image).
Adding Django to Airflow is probably not the best idea - Airflow has ~500 dependencies when installed with all providers, so chance is that you will have some difficults-to-resolve conflicts.
Also one of the things we consider for Airlfow 2.2 and beyond is to make better way of handling caching, which could help with building cacheable virtualenv created once and shared between workers/pods (but this is just in discussion phase)
You can check out tomorrow's session on Airflow Summit where we discuss what's coming (and generally Airflow Summit is cool):
https://airflowsummit.org/sessions/2021/looking-ahead-what-comes-after-airflow-2/
I am dockerizing a Python webapp using the https://hub.docker.com/r/tiangolo/uwsgi-nginx image, which uses supervisor to control the uWSGI instance.
My app actually requires an additional supervisor-mediated process to run (LibreOffice headless, with which I generate documents through the appy module), and I'm wondering what is the proper pattern to implement it.
The way I see it, I could extend the above image with the extra supervisor config for my needs (along with all the necessary OS-level install steps), but this would be in contradiction with the general principle of running the least amount of distinct processes in a given container. However, since my Python app is designed to talk with LibreOffice only locally, I'm not sure how I could achieve it with a more containerized approach. Thanks for any help or suggestion.
The recommendation for one-process-per-container is sound - Docker only monitors the process it starts when the container runs, so if you have multiple processes they're not watched by Docker. It's also a better design - you have lightweight, focused containers with single responsibilities, and you can manage them independently.
user2105103 is right though, the image you're using already loses that benefit because it runs Python and Nginx, and you could extend it with LibreOffice headless and package your whole app without changing code.
If you move to a more "best practice" approach, you'd have a distributed app running across three containers in a Docker network:
nginx - web proxy, this is the public entry point to the app. Nginx can do routing, caching, SSL termination, rate limiting etc.
app - your Python app, only visible inside the Docker network. Receives requests from nginx and uses libreoffice for document manipulation;
libreoffice - running in headless mode with the API exposed, but only available within the Docker network.
You'd need code changes for this, bringing in something like PyOO to use the LibreOffice API remotely from the app container.
You've already blown the "one process per container" -- just add another process. It's not a hard rule, or even one that everybody agrees with.
Extend away, or better yet author your own custom container. That way you own it, you understand it, and it's optimized for your purpose.
I have a backend server running on heroku. Right now for going through logs all I have been using is the 'heroku logs' command. I have been using that command also to track how long different requests to each endpoint are taking.
Is there a better way to see a list of how long requests to different endpoints are taking and a good way to track bottlenecks for what is slowing down these endpoints? Also is there any good add ons for heroku that can point out bad responses that are not status =200?
I am using python with django if that is relevant.
The best tool I found is newrelic.com It hooks nicely into django apps and heroku. It can even show you the bottlenecks due to queries and functions inside your views.
I am working on a web application that uses a permanent object MyService. Using a web interface I am dynamically updating its state and monitor its behavior. Now I would like to periodically call one of its methods. I was thinking of using celery PeriodicTask but run into some scope issues. It seems I need to execute three different processes:
python manage.py runserver
python manage.py celery worker
python manage.py celerybeat
The problem is that even if I ensure that MyService is a singleton that can be safely used by more than one thread, celery creates its own fresh copy of the object. Is there a way I could share this object between both django server and celery main process? I tried to find a way to start celery from within django script but until now with no success. Would appreciate any help.
If you need to share something between multiple processes or maybe even multiple machines (eg. your workers could run on a seperate machine) the best (and probably easiest) practice to share information would be using an external service.
In the simplest case you could use Django's DB, but if you encounter that this is not suitable for you, for example if you have a heavy write load you can use something like Redis or Memcache (which you can also talk to via Django's caching API). These will enable you to be able to handle a big write load and besides you can use eg. Redis as a queue for celery as well.