I want to use AWS-Fargate as a data PreProcessor in a ML pipe.
I deployed Docker Image containing my python script within AWS-ECR.
I also created a task with this image.
My questions are :
Which cluster should i use, i don't understand well the concept of cluster.
How to deploy in the pipe (the best should be triger by s3 event and execute as docker run)
Thank you for your answers
Cluster to use is Networking only:
With this option, you can launch a cluster with a new VPC to use for Fargate tasks.
The CICD pipeline for automated deployment of your Docker images to your ECS service could be constructed using CodePipeline. The good start towards it is the following tutorial:
. Tutorial: Continuous Deployment with CodePipeline for ECS
Related
tl;dr
Apache Beam pipeline step involes building docker image; How to run this pipeline using Google Dataflow? What alternatives exist?
I'm currently trying make my first steps with google's dataflow service and apache beam (python).
Trivial examples are pretty straight forward but things get confusing to me as soon as external software dependencies come into play. It seems to be possible to use custom docker containers to setup ones own environment [1][2]. While that's great for most dependencies, it doesn't help, if the dependency is docker itself, as it happens to be the case for me:
One step of my pipeline involves using an external project which makes heavy use of docker (i. e. building images, running them)
As far as I can tell there are three options to tackle that problem:
Docker within Docker
Run the external project's scripts which build docker images within a docker container running on a dataflow worker node. While building docker image within docker is possible in principle [3] I've got the feeling that won't work in this case, since there is only very limited control over the environment.
Custom VM image for worker nodes
Is it possible to use custom vm images for dataflow worker nodes?
Don't use Google Dataflow
What are better suited alternative services?
Thanks!
[1] Custom VM images for Google Cloud Dataflow workers
[2] https://cloud.google.com/dataflow/docs/guides/using-custom-containers
[3] https://www.docker.com/blog/docker-can-now-run-within-docker/
Edit: Added line breaks.
Custom VM image for worker nodes Is it possible to use custom vm images for dataflow worker nodes?
It's not possible to completely replace the Dataflow worker. But you can use a custom Beam SDK Docker container as you noted. This will result in a Docker in Docker type execution for your case.
Don't use Google Dataflow What are better suited alternative services?
Please see here for other Beam runners and their capabilities.
I'm looking to create a publisher that streams and sends tweets containing a certain hashtag to a pub/sub topic.
The tweets will then be ingested with cloud dataflow and then loaded into a Big Query database.
In the following article they do something similar where the publisher is hosted on a docker image on a Google Compute Engine instance.
Can anyone recommend alternative Google Cloud resources that could host the publisher code more simply, that avoids the need to create a docker file etc?
The publisher would need to run constantly. Would cloud run for e.g. be a suitable alternative?
There are some workarounds I can think of:
A quick way to avoid containers architecture is having the on_data method inside a loop, for example, by using something like while(true) or start a Stream like explained in Create your Python script and run the code in a Compute Engine in the background with nohup python -u myscript.py. Or follow the steps described in Script on GCE to capture tweets that uses tweepy.Stream to start the streaming.
You might want to reconsider the Dockerfile option since its configuration could be not so difficult, see Tweets & pipelines where there is a script that read the data and publish to PubSub, you will see that 9 lines are used for the Docker file and it is deployed in App Engine using Cloud Build. Another implementation with a Docker file that requires more steps is twitter-for-bigquery, in case it helps, you will see that there are more specific steps and more configurations.
Cloud Functions is also another option, in this guide Serverless Twitter with Google Cloud you can check the Design section to know if it fits your use case.
Airflow with Twitter Scraper could work for you since Cloud Composer is a managed service for Airflow and you can create an Airflow environment quickly. It uses the Twint library, check the Technical section in the link for more details.
Stream Twitter Data into BigQuery with Cloud Dataprep is a workaround that put aside complex configurations. In this case the job won't run constantly but can be scheduled to run within minutes.
I have a python script that looks like this:
def process_number(num):
# Do some processing using this number
print(num)
I want to spin up a kubernetes cluster and pass in a range of numbers, and have this script run in parallel across many machines. The number range can be hardcoded. I am unsure how to set up the dockerfile for this application, and how to deploy it to kubernetes, as most of the examples I can find are for web apps.
You need a base image that has python. You'll also need to define more of the script to ensure that this function runs over and over again, otherwise, the container will run and end very quickly which leads to crashLoopBackoff.
Before worrying about how to deploy this in kubernetes, you need to make sure you have a working script and that you can containerize it.
I recommend you containerize your application first and ensure it is running well before move to the kubernetes.
You can see some example of how containerize it in theses links:
Dockerize your Python Application
Containerize a Python App in 5 Minutes
After ensure your application in running as expect in a container, them you can move to he next step that is deploy it in kubernetes.
I strongly recommend you start with some kubernetes tutorials to understand how the components works and how you can deploy your application. You can use this links:
Local Development Environment for Kubernetes using Minikube
Getting Started With Kubernetes #1
Getting Started with Kubernetes #2
Kubernetes: An Introduction
kubernetes the hard way
k8s-intro-tutorials
If you don't have too much time, or you just need something that work for you case, take a look in Kubernetes Deployment, you can use your containerize application across multiples replicas.
To expose your service (internally and externally) you could use Kubernetes Service.
To test locally you can use the minikube installation on your machine
Hope that helps!
This question is related to understanding a concept regarding the DockerOperator and Apache Airflow, so I am not sure if this site is the correct place. If not, please let me know where I can post it.
The situation is the following: I am working with a Windows laptop, I have a developed very basic ETL pipeline that extracts data from some server and writes the unprocessed data into a MongoDB on a scheduled basis with Apache-Airflow. I have a docker-compose.yml file with three services: A mongo service for the MongoDB, a mongo-express service as admin tool for the MongoDB, a webserver service for Apache-Airflow and a postgres service as database backend for Apache-Airflow.
So far, I have developed some Python code in functions and these functions are being called by the Airflow instance using the PythonOperator. Since debugging is very difficult using the PythonOperator, I want to try the DockerOperator now instead. I have been following this tutorial which claims that using the DockerOperator, you can develop your source code independent of the operating system the code will later be executed on due to Docker's concept 'build once, run everywhere'.
My problem is that I didn't fully understand all necessary steps needed to run code using the DockerOperator. In the tutorial, I have the following questions regarding the Task Development and Deployment:
Package the artifacts together with all dependencies into a Docker image. ==> Does this mean that I have to create a Dockerfile for every task and then build an image using this Dockerfile?
Expose an Entrypoint from your container to invoke and parameterize a task using the DockerOperator. ==> How do you do this?
Thanks for your time, I highly appreciate it!
Typically you're going to have a Docker image that handles one type of task. So for any one pipeline you'd probably be using a variety of different Docker images, one different one for each step.
There are a couple of considerations here in regards to your question which is specifically around deployment.
You'll need to create a Docker image. You likely want to add a tag to this as you will want to version the image. The DockerOperator defaults to the latest tag on an image.
The image needs to be available to your deployed instance of Airflow. They can be built on the machine you're running Airflow on if you're wanting to run it locally. If you've deployed Airflow somewhere online, the more common practice would be to push them to a cloud service. There are a number of providers you can use (Docker Hub, Amazon ECR, etc...).
Expose an Entrypoint from your container to invoke and parameterize a task using the DockerOperator. ==> How do you do this?
If you have your image built, and is available to Airflow you simply need to create a task using the DockerOperator like so:
dag = DAG(**kwargs)
task_1 = DockerOperator(
dag=dag,
task_id='docker_task',
image='dummyorg/dummy_api_tools:v1',
auto_remove=True,
docker_url='unix://var/run/docker.sock',
command='python extract_from_api_or_something.py'
)
I'd recommend investing some time into understanding Docker. It's a little bit difficult to wrap your head around at first but it's a highly valuable tool, especially for systems like Airflow.
I'm creating microservices using Docker Containers.
Initially, I run one Docker container and it provides me with some output that I need as input for a second docker container.
So the flow of steps would be:
Run Docker container;
Get output;
Trigger running of second Docker container with previous output.
I have looked into Kubernetes, cloud functions and pub/sub on Google Cloud. Though I would like to run this locally first.
Unfortunately, I haven't found a solution, my processes are more like scripts than web-based applications.
If you are asking from a auto-devops point of view, you can try utilizing kubectl wait command for this.
Prepare yaml files for each of the containers (eg pod, deployment or statefulset)
Write a shell script, running kubectl apply -f on the files.
Use kubectl wait and jsonpath to pass info from prvious pods to the next, once it's in the ready / RUNNING state.
Find the detailed docs here.
You can have a look at Kubernetes Jobs. It might do the trick, although Kubernetes was not really made for running scripts in sequence and sharing dependencies between containers.
This thread is similar to what you need. One tool mentioned that intrigued me was brigade.