There's a similar question here but from 2018 were the solution requires changing the base image for the workers. Another suggestion is to ssh into each node and apt-get install there. This doesn't seem useful because when auto scale spawns new nodes, you'd need to do it again and again.
Anyway, is there a reasonable way to upgrade the base gcloud in late 2020?
Because task instances run in a shared execution environment, it is generally not recommended to use the gcloud CLI within Composer Airflow tasks, when possible, to avoid state or version conflicts. For example, if you have multiple users using the same Cloud Composer environment, and either of them changes the active credentials used by gcloud, then they can unknowingly break the other's workflows.
Instead, consider using the Cloud SDK Python libraries to do what you need to do programmatically, or use the airflow.providers.google.cloud operators, which may already have what you need.
If you really need to use the gcloud CLI and don't share the environment, then you can use a BashOperator with a install/upgrade script to create a prerequisite for any tasks that need to use the CLI. Alternatively, you can build a custom Docker image with gcloud installed, and use GKEPodOperator or KubernetesPodOperator to run a Kubernetes pod to run the CLI command. That would be slower, but more reliable than verifying dependencies each time.
Related
I am trying to wrap my head around how DAGs should actually be executed in docker compose environment, when they are dependent on other service (separate python venv) defined in the compose file.
I have setup airflow via docker compose as mentioned in official documentation. Also, I have added a Django service, which has its own dependencies.
Now, i would like to have a DAG, that executes python script using that Django's service python environment (It also uses Django's models. Not sure if that's relevant).
The only way I see it working is with DockerOperator as described here. I managed to setup and execute the test DAG mentioned there, however when I try to run the real task, it fails due to networking issues. Iam quite confident i can solve that issue, but setting everything this way just seems like way too much hassle.
So, in the end I guess Iam wondering what the ideal architecture should when using Airflow via compose? Should the base airflow image be extended with my Django service (creating one hell of a big image) or is there a better way?
You can use PythonVirtualEnvOperator (https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html#pythonvirtualenvoperator) - but it will recreate django virtualenv every time task is run so not ideal
Another option will be to run DockerOperator or KubernetesPodOperator (if you use Kubernetes) and have separate image with Django installed (or even base Django image).
Adding Django to Airflow is probably not the best idea - Airflow has ~500 dependencies when installed with all providers, so chance is that you will have some difficults-to-resolve conflicts.
Also one of the things we consider for Airlfow 2.2 and beyond is to make better way of handling caching, which could help with building cacheable virtualenv created once and shared between workers/pods (but this is just in discussion phase)
You can check out tomorrow's session on Airflow Summit where we discuss what's coming (and generally Airflow Summit is cool):
https://airflowsummit.org/sessions/2021/looking-ahead-what-comes-after-airflow-2/
I'm using venv for my python repo on github, and wanted to run the same code on 10+ ec2 instances (each instance will have a cronjob that just runs the same code on the same schedule)
Any recommendations on how to best achieve this + continue to make sure all instances get the latest release branches on github? I'd like to try and automate any configuration I need to do, so that I'm not doing this:
Create one ec2 instance, set up all the configurations I need, like download latest python version, etc. Then git clone, set up all the python packages I need using venv. Verify code works on this instance.
Repeat for remaining 10+ ec2 instances
Whenever someone releases a new master branch, I have to ssh into every ec2 instances, git pull to the correct branch, re-update any new configurations I need, repeat for all remaining 10+ ec2 instances.
Ideally I can just run some script that pushes everything that's needed to make the code work on all ec2 instances. I have little experience with this type of thing, but from reading around this is an approach I'm considering. Am I on the right track?:
Create a script I run to ssh into all my ec2 instances and git clone/update to correct branch
Use Docker to make sure all ec2 instances are set up properly so the python code works (Is this the right use-case for Docker?). Above script will run the necessary Docker commands
Similar thing with using venv and reading the requirements.txt file so all ec2 instances has the right python packages and versions
Depending on your app and requirements (is EC2 100% necessary?) I can recommend following:
Capistrano-like SSH deployments (https://github.com/dlapiduz/fabistrano) if your fleet is static and you need fast deployments. Not a best practice and not terribly secure, but you mentioned similar scheme in your post
Using AWS Image Builder (https://aws.amazon.com/image-builder/) or Packer (https://www.packer.io/) to build new release image and then replace old image with new in your EC2 autoscaling group
Build docker image of your app and use ECS or EKS to host it. I would recommend this approach if you're not married to running code directly on EC2 hosts.
I've got a little question about dependency management for packages used in python operators
We are using airflow in a industralized mode to run scheduled python jobs. it works well but we are facing issues to deal with different python lib needed for each DAG.
Do you have any idea on how to let developers install their own dependencies for their jobs without being admin and being sure that these dependencies don't collide with other jobs ?
Would you recommend having a bash task that loads a virtual env at the beginning of the job ? Any official recommandation to do it ?
Thanks !
Romain.
In general I see two possible solutions for your problem:
Airflow has a PythonVirtualEnvOperator which allows a task to run in a virtualenv which gets created and destroyed automatically. You can pass a python_version and a list of requirements to the task to build the virtual env.
Set up a docker registry and use a DockerOperator rather than a PythonOperator. This would allow teams to set up their own Docker images with specific requirements. This is how I think Heineken set up their airflow jobs as presented in their Airflow Meetup. I'm trying to see whether they posted their slides online but I can't seem to find them.
I have a python project and i want to deploy it on an AWS EC2 instance. My project has dependencies to other python libraries and uses programs installed on my machine. What are the alternatives to deploy my project on an AWS EC2 instance?
Further details : My project consist on a celery periodic task that uses ffmpeg and blender to create short videos.
I have checked elastic bean stalk but it seems it is tailored for web apps. I don't know if containerizing my project via docker is a good idea...
The manual way and the cheapest way to do it would be :
1- Launch a spot instance
2- git clone the project
3- Install the librairies via pip
4- Install all dependant programs
5- Launch periodic task
I am looking for a more automatic way to do it.
Thanks.
Beanstalk is certainly an option. You don't necessarily have to use it for web apps and you can configure all of the dependencies needed via .ebextensions.
Containerization is usually my go to strategy now. If you get it working within Docker locally then you have several deployment options and the whole thing gets much easier since you don't have to worry about setting up all the dependencies within the AWS instance.
Once you have it running in Docker you could use Beanstalk, ECS or CodeDeploy.
I have implemented a recommendation engine using Python2.7 in Google Dataproc/ Spark, and need to store the output as records in Datastore, for subsequent use by App Engine APIs. However, there doesn't seem to be a way to do this directly.
There is no Python Datastore connector for Dataproc as far as I can see. The Python Dataflow SDK doesn't support writing to Datastore (although the Java one does). MapReduce doesn't have an output writer for Datastore.
That doesn't appear to leave many options. At the moment I think I will have to write the records to Google Cloud Storage and have a separate task running in App Engine to harvest them and store in Datastore. That is not ideal- aligning the two processes has its own difficulties.
Is there a better way to get the data from Dataproc into Datastore?
I succeeded in saving Datastore records from Dataproc. This involved installing additional components on the master VM (ssh from the console)
The appengine sdk is installed and initialised using
sudo apt-get install google-cloud-sdk-app-engine-python
sudo gcloud init
This places a new google directory under /usr/lib/google-cloud-sdk/platform/google_appengine/.
The datastore library is then installed via
sudo apt-get install python-dev
sudo apt-get install python-pip
sudo pip install -t /usr/lib/google-cloud-sdk/platform/google_appengine/ google-cloud-datastore
For reasons I have yet to understand, this actually installed at one level lower, i.e. in /usr/lib/google-cloud-sdk/platform/google_appengine/google/google, so for my purposes it was necessary to manually move the components up one level in the path.
To enable the interpreter to find this code I had to add /usr/lib/google-cloud-sdk/platform/google_appengine/ to the path. The usual BASH tricks weren't being sustained, so I ended up doing this at the start of my recommendation engine.
Because of the large amount of data to be stored, I also spent a lot of time attempting to save it via MapReduce. Ultimately I came to the conclusion that too many of the required services were missing on Dataproc. Instead I am using a multiprocessing pool, which is achieving acceptable performance
In the past, the Cloud Dataproc team maintained a Datastore Connector for Hadoop but it was deprecated for a number of reasons. At present, there are no formal plans to restart development of it.
The page mentioned above has a few options and your approach is one of the solutions mentioned. At this point, I think your setup is probably one of the easiest ones if you're committed to Cloud Datastore.