Parallelize a simple python script with kubernetes - python

I have a python script that looks like this:
def process_number(num):
# Do some processing using this number
print(num)
I want to spin up a kubernetes cluster and pass in a range of numbers, and have this script run in parallel across many machines. The number range can be hardcoded. I am unsure how to set up the dockerfile for this application, and how to deploy it to kubernetes, as most of the examples I can find are for web apps.

You need a base image that has python. You'll also need to define more of the script to ensure that this function runs over and over again, otherwise, the container will run and end very quickly which leads to crashLoopBackoff.
Before worrying about how to deploy this in kubernetes, you need to make sure you have a working script and that you can containerize it.

I recommend you containerize your application first and ensure it is running well before move to the kubernetes.
You can see some example of how containerize it in theses links:
Dockerize your Python Application
Containerize a Python App in 5 Minutes
After ensure your application in running as expect in a container, them you can move to he next step that is deploy it in kubernetes.
I strongly recommend you start with some kubernetes tutorials to understand how the components works and how you can deploy your application. You can use this links:
Local Development Environment for Kubernetes using Minikube
Getting Started With Kubernetes #1
Getting Started with Kubernetes #2
Kubernetes: An Introduction
kubernetes the hard way
k8s-intro-tutorials
If you don't have too much time, or you just need something that work for you case, take a look in Kubernetes Deployment, you can use your containerize application across multiples replicas.
To expose your service (internally and externally) you could use Kubernetes Service.
To test locally you can use the minikube installation on your machine
Hope that helps!

Related

How to use the DockerOperator from Apache Airflow

This question is related to understanding a concept regarding the DockerOperator and Apache Airflow, so I am not sure if this site is the correct place. If not, please let me know where I can post it.
The situation is the following: I am working with a Windows laptop, I have a developed very basic ETL pipeline that extracts data from some server and writes the unprocessed data into a MongoDB on a scheduled basis with Apache-Airflow. I have a docker-compose.yml file with three services: A mongo service for the MongoDB, a mongo-express service as admin tool for the MongoDB, a webserver service for Apache-Airflow and a postgres service as database backend for Apache-Airflow.
So far, I have developed some Python code in functions and these functions are being called by the Airflow instance using the PythonOperator. Since debugging is very difficult using the PythonOperator, I want to try the DockerOperator now instead. I have been following this tutorial which claims that using the DockerOperator, you can develop your source code independent of the operating system the code will later be executed on due to Docker's concept 'build once, run everywhere'.
My problem is that I didn't fully understand all necessary steps needed to run code using the DockerOperator. In the tutorial, I have the following questions regarding the Task Development and Deployment:
Package the artifacts together with all dependencies into a Docker image. ==> Does this mean that I have to create a Dockerfile for every task and then build an image using this Dockerfile?
Expose an Entrypoint from your container to invoke and parameterize a task using the DockerOperator. ==> How do you do this?
Thanks for your time, I highly appreciate it!
Typically you're going to have a Docker image that handles one type of task. So for any one pipeline you'd probably be using a variety of different Docker images, one different one for each step.
There are a couple of considerations here in regards to your question which is specifically around deployment.
You'll need to create a Docker image. You likely want to add a tag to this as you will want to version the image. The DockerOperator defaults to the latest tag on an image.
The image needs to be available to your deployed instance of Airflow. They can be built on the machine you're running Airflow on if you're wanting to run it locally. If you've deployed Airflow somewhere online, the more common practice would be to push them to a cloud service. There are a number of providers you can use (Docker Hub, Amazon ECR, etc...).
Expose an Entrypoint from your container to invoke and parameterize a task using the DockerOperator. ==> How do you do this?
If you have your image built, and is available to Airflow you simply need to create a task using the DockerOperator like so:
dag = DAG(**kwargs)
task_1 = DockerOperator(
dag=dag,
task_id='docker_task',
image='dummyorg/dummy_api_tools:v1',
auto_remove=True,
docker_url='unix://var/run/docker.sock',
command='python extract_from_api_or_something.py'
)
I'd recommend investing some time into understanding Docker. It's a little bit difficult to wrap your head around at first but it's a highly valuable tool, especially for systems like Airflow.

Is there a way to schedule docker containers to run with respect to one another?

I'm creating microservices using Docker Containers.
Initially, I run one Docker container and it provides me with some output that I need as input for a second docker container.
So the flow of steps would be:
Run Docker container;
Get output;
Trigger running of second Docker container with previous output.
I have looked into Kubernetes, cloud functions and pub/sub on Google Cloud. Though I would like to run this locally first.
Unfortunately, I haven't found a solution, my processes are more like scripts than web-based applications.
If you are asking from a auto-devops point of view, you can try utilizing kubectl wait command for this.
Prepare yaml files for each of the containers (eg pod, deployment or statefulset)
Write a shell script, running kubectl apply -f on the files.
Use kubectl wait and jsonpath to pass info from prvious pods to the next, once it's in the ready / RUNNING state.
Find the detailed docs here.
You can have a look at Kubernetes Jobs. It might do the trick, although Kubernetes was not really made for running scripts in sequence and sharing dependencies between containers.
This thread is similar to what you need. One tool mentioned that intrigued me was brigade.

Kubernetes OpenShift for python

I am new to openshift, we are trying to deploy a python module in a pod which is accessed by other python code running in different pods. When i was deploying, the pod is running and immediately crash with status "Crash Loop Back Off".This python code is an independent module which does not have valid entrypoint. So how to deploy those type of python modules in openshift. Appreciate for any solutions
You don't. You deploy something that can run as a process, and as such, has a capability to contact with the external world in some way (ie. listen for requests, connect to message broker, send requests, read/write to db etc.) you do not package and deploy libraries that on their own are inoperable to the cluster.

Run multiple Python scripts in Azure (using Docker?)

I have a Python script that consumes an Azure queue, and I would like to scale this easily inside Azure infrastructure. I'm looking for the easiest solution possible to
run the Python script in an environment that is as managed as possible
have a centralized way to see the scripts running and their output, and easily scale the amount of scripts running through a GUI or something very easy to use
I'm looking at Docker at the moment, but this seems very complicated for the extremely simple task I'm trying to achieve. What possible approaches are known to do this? An added bonus would be if I could scale wrt the amount of items on the queue, but it is fine if we'd just be able to manually control the amount of parallelism.
You should have a look at Azure Web Apps, which also support Python.
This would be a managed and scaleable environment and also supports background tasks (WebJobs) with a central logging.
Azure Web Apps also offer a free plan for development and testing.
Per my experience, I think CoreOS on Azure can satisfy your needs. You can try to refer to the doc https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-coreos-how-to/ to know how to get started.
CoreOS is a Linux distribution for running Docker as Linux container, that you can remote access via SSH client like putty. For using Docker, you can search the key words Docker tutorial via Bing to rapidly learning some simple usage that enough for running Python scripts.
Sounds to me like you are describing something like a micro-services architecture. From that perspective, Docker is a great choice. I recommend you consider using an orchestration framework such as Apache Mesos or Docker Swarm which will allow you to run your containers on a cluster of VMs with the ability to easily scale, deploy new versions, rollback and implement load balancing. The schedulers Mesos supports (Marathon and Chronos) also have a Web UI. I believe you can also implement some kind of triggered scaling like you describe but that will probably not be off the shelf.
This does seem like a bit of a learning curve but I think is worth it especially once you start considering the complexities of deploying new versions (with possible rollbacks), monitoring failures and even integrating things like Jenkins and continuous delivery.
For Azure, an easy way to deploy and configure a Mesos or Swarm cluster is by using Azure Container Service (ACS) which does all the hard work of configuring the cluster for you. Find additional info here: https://azure.microsoft.com/en-us/documentation/articles/container-service-intro/

docker: is it possible to run containers from an other container?

I am running a dockerized python web application that has to run long tasks on certain requests (i.e. running some R scripts taking around 1 minute to complete). At the moment I put everything in one container and I am just running it just like this.
However, I think that it would be faster and cleaner to separate this 'background web app' and the R scripts one process = one container). I was therefore wondering if there is a way to run a container from within an other container (i.e being able to call docker run [...] on the host from the already-dockerized web application).
I tried to search for it and found some useful information on linking containers together, but in my case I'd more be interested in being able to create single-use containers on the fly.
I quite like this solution: Run docker inside a docker container? which basically allows you to use docker that's running on the host.
But if you really want to run docker in docker, here is the official solution using the dind image: https://blog.docker.com/2013/09/docker-can-now-run-within-docker/

Categories

Resources