Building containers within google dataflow pipeline

Building containers within google dataflow pipeline - python

tl;dr
Apache Beam pipeline step involes building docker image; How to run this pipeline using Google Dataflow? What alternatives exist?
I'm currently trying make my first steps with google's dataflow service and apache beam (python).
Trivial examples are pretty straight forward but things get confusing to me as soon as external software dependencies come into play. It seems to be possible to use custom docker containers to setup ones own environment [1][2]. While that's great for most dependencies, it doesn't help, if the dependency is docker itself, as it happens to be the case for me:
One step of my pipeline involves using an external project which makes heavy use of docker (i. e. building images, running them)
As far as I can tell there are three options to tackle that problem:
Docker within Docker
Run the external project's scripts which build docker images within a docker container running on a dataflow worker node. While building docker image within docker is possible in principle [3] I've got the feeling that won't work in this case, since there is only very limited control over the environment.
Custom VM image for worker nodes
Is it possible to use custom vm images for dataflow worker nodes?
Don't use Google Dataflow
What are better suited alternative services?
Thanks!
[1] Custom VM images for Google Cloud Dataflow workers
[2] https://cloud.google.com/dataflow/docs/guides/using-custom-containers
[3] https://www.docker.com/blog/docker-can-now-run-within-docker/
Edit: Added line breaks.

Custom VM image for worker nodes Is it possible to use custom vm images for dataflow worker nodes?
It's not possible to completely replace the Dataflow worker. But you can use a custom Beam SDK Docker container as you noted. This will result in a Docker in Docker type execution for your case.
Don't use Google Dataflow What are better suited alternative services?
Please see here for other Beam runners and their capabilities.

Related

How to use AWS-Fargate as a python script

I want to use AWS-Fargate as a data PreProcessor in a ML pipe.
I deployed Docker Image containing my python script within AWS-ECR.
I also created a task with this image.
My questions are :
Which cluster should i use, i don't understand well the concept of cluster.
How to deploy in the pipe (the best should be triger by s3 event and execute as docker run)
Thank you for your answers

Cluster to use is Networking only:
With this option, you can launch a cluster with a new VPC to use for Fargate tasks.
The CICD pipeline for automated deployment of your Docker images to your ECS service could be constructed using CodePipeline. The good start towards it is the following tutorial:
. Tutorial: Continuous Deployment with CodePipeline for ECS

Parallelize a simple python script with kubernetes

I have a python script that looks like this:
def process_number(num):
# Do some processing using this number
print(num)
I want to spin up a kubernetes cluster and pass in a range of numbers, and have this script run in parallel across many machines. The number range can be hardcoded. I am unsure how to set up the dockerfile for this application, and how to deploy it to kubernetes, as most of the examples I can find are for web apps.

You need a base image that has python. You'll also need to define more of the script to ensure that this function runs over and over again, otherwise, the container will run and end very quickly which leads to crashLoopBackoff.
Before worrying about how to deploy this in kubernetes, you need to make sure you have a working script and that you can containerize it.

I recommend you containerize your application first and ensure it is running well before move to the kubernetes.
You can see some example of how containerize it in theses links:
Dockerize your Python Application
Containerize a Python App in 5 Minutes
After ensure your application in running as expect in a container, them you can move to he next step that is deploy it in kubernetes.
I strongly recommend you start with some kubernetes tutorials to understand how the components works and how you can deploy your application. You can use this links:
Local Development Environment for Kubernetes using Minikube
Getting Started With Kubernetes #1
Getting Started with Kubernetes #2
Kubernetes: An Introduction
kubernetes the hard way
k8s-intro-tutorials
If you don't have too much time, or you just need something that work for you case, take a look in Kubernetes Deployment, you can use your containerize application across multiples replicas.
To expose your service (internally and externally) you could use Kubernetes Service.
To test locally you can use the minikube installation on your machine
Hope that helps!

How to use the DockerOperator from Apache Airflow

This question is related to understanding a concept regarding the DockerOperator and Apache Airflow, so I am not sure if this site is the correct place. If not, please let me know where I can post it.
The situation is the following: I am working with a Windows laptop, I have a developed very basic ETL pipeline that extracts data from some server and writes the unprocessed data into a MongoDB on a scheduled basis with Apache-Airflow. I have a docker-compose.yml file with three services: A mongo service for the MongoDB, a mongo-express service as admin tool for the MongoDB, a webserver service for Apache-Airflow and a postgres service as database backend for Apache-Airflow.
So far, I have developed some Python code in functions and these functions are being called by the Airflow instance using the PythonOperator. Since debugging is very difficult using the PythonOperator, I want to try the DockerOperator now instead. I have been following this tutorial which claims that using the DockerOperator, you can develop your source code independent of the operating system the code will later be executed on due to Docker's concept 'build once, run everywhere'.
My problem is that I didn't fully understand all necessary steps needed to run code using the DockerOperator. In the tutorial, I have the following questions regarding the Task Development and Deployment:
Package the artifacts together with all dependencies into a Docker image. ==> Does this mean that I have to create a Dockerfile for every task and then build an image using this Dockerfile?
Expose an Entrypoint from your container to invoke and parameterize a task using the DockerOperator. ==> How do you do this?
Thanks for your time, I highly appreciate it!

Typically you're going to have a Docker image that handles one type of task. So for any one pipeline you'd probably be using a variety of different Docker images, one different one for each step.
There are a couple of considerations here in regards to your question which is specifically around deployment.
You'll need to create a Docker image. You likely want to add a tag to this as you will want to version the image. The DockerOperator defaults to the latest tag on an image.
The image needs to be available to your deployed instance of Airflow. They can be built on the machine you're running Airflow on if you're wanting to run it locally. If you've deployed Airflow somewhere online, the more common practice would be to push them to a cloud service. There are a number of providers you can use (Docker Hub, Amazon ECR, etc...).
Expose an Entrypoint from your container to invoke and parameterize a task using the DockerOperator. ==> How do you do this?
If you have your image built, and is available to Airflow you simply need to create a task using the DockerOperator like so:
dag = DAG(**kwargs)
task_1 = DockerOperator(
dag=dag,
task_id='docker_task',
image='dummyorg/dummy_api_tools:v1',
auto_remove=True,
docker_url='unix://var/run/docker.sock',
command='python extract_from_api_or_something.py'
)
I'd recommend investing some time into understanding Docker. It's a little bit difficult to wrap your head around at first but it's a highly valuable tool, especially for systems like Airflow.

Is there a way to schedule docker containers to run with respect to one another?

I'm creating microservices using Docker Containers.
Initially, I run one Docker container and it provides me with some output that I need as input for a second docker container.
So the flow of steps would be:
Run Docker container;
Get output;
Trigger running of second Docker container with previous output.
I have looked into Kubernetes, cloud functions and pub/sub on Google Cloud. Though I would like to run this locally first.
Unfortunately, I haven't found a solution, my processes are more like scripts than web-based applications.

If you are asking from a auto-devops point of view, you can try utilizing kubectl wait command for this.
Prepare yaml files for each of the containers (eg pod, deployment or statefulset)
Write a shell script, running kubectl apply -f on the files.
Use kubectl wait and jsonpath to pass info from prvious pods to the next, once it's in the ready / RUNNING state.
Find the detailed docs here.

You can have a look at Kubernetes Jobs. It might do the trick, although Kubernetes was not really made for running scripts in sequence and sharing dependencies between containers.
This thread is similar to what you need. One tool mentioned that intrigued me was brigade.

Run multiple Python scripts in Azure (using Docker?)

I have a Python script that consumes an Azure queue, and I would like to scale this easily inside Azure infrastructure. I'm looking for the easiest solution possible to
run the Python script in an environment that is as managed as possible
have a centralized way to see the scripts running and their output, and easily scale the amount of scripts running through a GUI or something very easy to use
I'm looking at Docker at the moment, but this seems very complicated for the extremely simple task I'm trying to achieve. What possible approaches are known to do this? An added bonus would be if I could scale wrt the amount of items on the queue, but it is fine if we'd just be able to manually control the amount of parallelism.

You should have a look at Azure Web Apps, which also support Python.
This would be a managed and scaleable environment and also supports background tasks (WebJobs) with a central logging.
Azure Web Apps also offer a free plan for development and testing.

Per my experience, I think CoreOS on Azure can satisfy your needs. You can try to refer to the doc https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-coreos-how-to/ to know how to get started.
CoreOS is a Linux distribution for running Docker as Linux container, that you can remote access via SSH client like putty. For using Docker, you can search the key words Docker tutorial via Bing to rapidly learning some simple usage that enough for running Python scripts.

Sounds to me like you are describing something like a micro-services architecture. From that perspective, Docker is a great choice. I recommend you consider using an orchestration framework such as Apache Mesos or Docker Swarm which will allow you to run your containers on a cluster of VMs with the ability to easily scale, deploy new versions, rollback and implement load balancing. The schedulers Mesos supports (Marathon and Chronos) also have a Web UI. I believe you can also implement some kind of triggered scaling like you describe but that will probably not be off the shelf.
This does seem like a bit of a learning curve but I think is worth it especially once you start considering the complexities of deploying new versions (with possible rollbacks), monitoring failures and even integrating things like Jenkins and continuous delivery.
For Azure, an easy way to deploy and configure a Mesos or Swarm cluster is by using Azure Container Service (ACS) which does all the hard work of configuring the cluster for you. Find additional info here: https://azure.microsoft.com/en-us/documentation/articles/container-service-intro/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.