How to use the DockerOperator from Apache Airflow

How to use the DockerOperator from Apache Airflow - python

This question is related to understanding a concept regarding the DockerOperator and Apache Airflow, so I am not sure if this site is the correct place. If not, please let me know where I can post it.
The situation is the following: I am working with a Windows laptop, I have a developed very basic ETL pipeline that extracts data from some server and writes the unprocessed data into a MongoDB on a scheduled basis with Apache-Airflow. I have a docker-compose.yml file with three services: A mongo service for the MongoDB, a mongo-express service as admin tool for the MongoDB, a webserver service for Apache-Airflow and a postgres service as database backend for Apache-Airflow.
So far, I have developed some Python code in functions and these functions are being called by the Airflow instance using the PythonOperator. Since debugging is very difficult using the PythonOperator, I want to try the DockerOperator now instead. I have been following this tutorial which claims that using the DockerOperator, you can develop your source code independent of the operating system the code will later be executed on due to Docker's concept 'build once, run everywhere'.
My problem is that I didn't fully understand all necessary steps needed to run code using the DockerOperator. In the tutorial, I have the following questions regarding the Task Development and Deployment:
Package the artifacts together with all dependencies into a Docker image. ==> Does this mean that I have to create a Dockerfile for every task and then build an image using this Dockerfile?
Expose an Entrypoint from your container to invoke and parameterize a task using the DockerOperator. ==> How do you do this?
Thanks for your time, I highly appreciate it!

Typically you're going to have a Docker image that handles one type of task. So for any one pipeline you'd probably be using a variety of different Docker images, one different one for each step.
There are a couple of considerations here in regards to your question which is specifically around deployment.
You'll need to create a Docker image. You likely want to add a tag to this as you will want to version the image. The DockerOperator defaults to the latest tag on an image.
The image needs to be available to your deployed instance of Airflow. They can be built on the machine you're running Airflow on if you're wanting to run it locally. If you've deployed Airflow somewhere online, the more common practice would be to push them to a cloud service. There are a number of providers you can use (Docker Hub, Amazon ECR, etc...).
Expose an Entrypoint from your container to invoke and parameterize a task using the DockerOperator. ==> How do you do this?
If you have your image built, and is available to Airflow you simply need to create a task using the DockerOperator like so:
dag = DAG(**kwargs)
task_1 = DockerOperator(
dag=dag,
task_id='docker_task',
image='dummyorg/dummy_api_tools:v1',
auto_remove=True,
docker_url='unix://var/run/docker.sock',
command='python extract_from_api_or_something.py'
)
I'd recommend investing some time into understanding Docker. It's a little bit difficult to wrap your head around at first but it's a highly valuable tool, especially for systems like Airflow.

Related

Building containers within google dataflow pipeline

tl;dr
Apache Beam pipeline step involes building docker image; How to run this pipeline using Google Dataflow? What alternatives exist?
I'm currently trying make my first steps with google's dataflow service and apache beam (python).
Trivial examples are pretty straight forward but things get confusing to me as soon as external software dependencies come into play. It seems to be possible to use custom docker containers to setup ones own environment [1][2]. While that's great for most dependencies, it doesn't help, if the dependency is docker itself, as it happens to be the case for me:
One step of my pipeline involves using an external project which makes heavy use of docker (i. e. building images, running them)
As far as I can tell there are three options to tackle that problem:
Docker within Docker
Run the external project's scripts which build docker images within a docker container running on a dataflow worker node. While building docker image within docker is possible in principle [3] I've got the feeling that won't work in this case, since there is only very limited control over the environment.
Custom VM image for worker nodes
Is it possible to use custom vm images for dataflow worker nodes?
Don't use Google Dataflow
What are better suited alternative services?
Thanks!
[1] Custom VM images for Google Cloud Dataflow workers
[2] https://cloud.google.com/dataflow/docs/guides/using-custom-containers
[3] https://www.docker.com/blog/docker-can-now-run-within-docker/
Edit: Added line breaks.

Custom VM image for worker nodes Is it possible to use custom vm images for dataflow worker nodes?
It's not possible to completely replace the Dataflow worker. But you can use a custom Beam SDK Docker container as you noted. This will result in a Docker in Docker type execution for your case.
Don't use Google Dataflow What are better suited alternative services?
Please see here for other Beam runners and their capabilities.

How DAGs should be run when airflow deployed with docker compose

I am trying to wrap my head around how DAGs should actually be executed in docker compose environment, when they are dependent on other service (separate python venv) defined in the compose file.
I have setup airflow via docker compose as mentioned in official documentation. Also, I have added a Django service, which has its own dependencies.
Now, i would like to have a DAG, that executes python script using that Django's service python environment (It also uses Django's models. Not sure if that's relevant).
The only way I see it working is with DockerOperator as described here. I managed to setup and execute the test DAG mentioned there, however when I try to run the real task, it fails due to networking issues. Iam quite confident i can solve that issue, but setting everything this way just seems like way too much hassle.
So, in the end I guess Iam wondering what the ideal architecture should when using Airflow via compose? Should the base airflow image be extended with my Django service (creating one hell of a big image) or is there a better way?

You can use PythonVirtualEnvOperator (https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html#pythonvirtualenvoperator) - but it will recreate django virtualenv every time task is run so not ideal
Another option will be to run DockerOperator or KubernetesPodOperator (if you use Kubernetes) and have separate image with Django installed (or even base Django image).
Adding Django to Airflow is probably not the best idea - Airflow has ~500 dependencies when installed with all providers, so chance is that you will have some difficults-to-resolve conflicts.
Also one of the things we consider for Airlfow 2.2 and beyond is to make better way of handling caching, which could help with building cacheable virtualenv created once and shared between workers/pods (but this is just in discussion phase)
You can check out tomorrow's session on Airflow Summit where we discuss what's coming (and generally Airflow Summit is cool):
https://airflowsummit.org/sessions/2021/looking-ahead-what-comes-after-airflow-2/

What's the best practice to run the same github python code on several ec2 instances?

I'm using venv for my python repo on github, and wanted to run the same code on 10+ ec2 instances (each instance will have a cronjob that just runs the same code on the same schedule)
Any recommendations on how to best achieve this + continue to make sure all instances get the latest release branches on github? I'd like to try and automate any configuration I need to do, so that I'm not doing this:
Create one ec2 instance, set up all the configurations I need, like download latest python version, etc. Then git clone, set up all the python packages I need using venv. Verify code works on this instance.
Repeat for remaining 10+ ec2 instances
Whenever someone releases a new master branch, I have to ssh into every ec2 instances, git pull to the correct branch, re-update any new configurations I need, repeat for all remaining 10+ ec2 instances.
Ideally I can just run some script that pushes everything that's needed to make the code work on all ec2 instances. I have little experience with this type of thing, but from reading around this is an approach I'm considering. Am I on the right track?:
Create a script I run to ssh into all my ec2 instances and git clone/update to correct branch
Use Docker to make sure all ec2 instances are set up properly so the python code works (Is this the right use-case for Docker?). Above script will run the necessary Docker commands
Similar thing with using venv and reading the requirements.txt file so all ec2 instances has the right python packages and versions

Depending on your app and requirements (is EC2 100% necessary?) I can recommend following:
Capistrano-like SSH deployments (https://github.com/dlapiduz/fabistrano) if your fleet is static and you need fast deployments. Not a best practice and not terribly secure, but you mentioned similar scheme in your post
Using AWS Image Builder (https://aws.amazon.com/image-builder/) or Packer (https://www.packer.io/) to build new release image and then replace old image with new in your EC2 autoscaling group
Build docker image of your app and use ECS or EKS to host it. I would recommend this approach if you're not married to running code directly on EC2 hosts.

Is there a way to schedule docker containers to run with respect to one another?

I'm creating microservices using Docker Containers.
Initially, I run one Docker container and it provides me with some output that I need as input for a second docker container.
So the flow of steps would be:
Run Docker container;
Get output;
Trigger running of second Docker container with previous output.
I have looked into Kubernetes, cloud functions and pub/sub on Google Cloud. Though I would like to run this locally first.
Unfortunately, I haven't found a solution, my processes are more like scripts than web-based applications.

If you are asking from a auto-devops point of view, you can try utilizing kubectl wait command for this.
Prepare yaml files for each of the containers (eg pod, deployment or statefulset)
Write a shell script, running kubectl apply -f on the files.
Use kubectl wait and jsonpath to pass info from prvious pods to the next, once it's in the ready / RUNNING state.
Find the detailed docs here.

You can have a look at Kubernetes Jobs. It might do the trick, although Kubernetes was not really made for running scripts in sequence and sharing dependencies between containers.
This thread is similar to what you need. One tool mentioned that intrigued me was brigade.

Run multiple Python scripts in Azure (using Docker?)

I have a Python script that consumes an Azure queue, and I would like to scale this easily inside Azure infrastructure. I'm looking for the easiest solution possible to
run the Python script in an environment that is as managed as possible
have a centralized way to see the scripts running and their output, and easily scale the amount of scripts running through a GUI or something very easy to use
I'm looking at Docker at the moment, but this seems very complicated for the extremely simple task I'm trying to achieve. What possible approaches are known to do this? An added bonus would be if I could scale wrt the amount of items on the queue, but it is fine if we'd just be able to manually control the amount of parallelism.

You should have a look at Azure Web Apps, which also support Python.
This would be a managed and scaleable environment and also supports background tasks (WebJobs) with a central logging.
Azure Web Apps also offer a free plan for development and testing.

Per my experience, I think CoreOS on Azure can satisfy your needs. You can try to refer to the doc https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-coreos-how-to/ to know how to get started.
CoreOS is a Linux distribution for running Docker as Linux container, that you can remote access via SSH client like putty. For using Docker, you can search the key words Docker tutorial via Bing to rapidly learning some simple usage that enough for running Python scripts.

Sounds to me like you are describing something like a micro-services architecture. From that perspective, Docker is a great choice. I recommend you consider using an orchestration framework such as Apache Mesos or Docker Swarm which will allow you to run your containers on a cluster of VMs with the ability to easily scale, deploy new versions, rollback and implement load balancing. The schedulers Mesos supports (Marathon and Chronos) also have a Web UI. I believe you can also implement some kind of triggered scaling like you describe but that will probably not be off the shelf.
This does seem like a bit of a learning curve but I think is worth it especially once you start considering the complexities of deploying new versions (with possible rollbacks), monitoring failures and even integrating things like Jenkins and continuous delivery.
For Azure, an easy way to deploy and configure a Mesos or Swarm cluster is by using Azure Container Service (ACS) which does all the hard work of configuring the cluster for you. Find additional info here: https://azure.microsoft.com/en-us/documentation/articles/container-service-intro/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.