I've got a little question about dependency management for packages used in python operators
We are using airflow in a industralized mode to run scheduled python jobs. it works well but we are facing issues to deal with different python lib needed for each DAG.
Do you have any idea on how to let developers install their own dependencies for their jobs without being admin and being sure that these dependencies don't collide with other jobs ?
Would you recommend having a bash task that loads a virtual env at the beginning of the job ? Any official recommandation to do it ?
Thanks !
Romain.
In general I see two possible solutions for your problem:
Airflow has a PythonVirtualEnvOperator which allows a task to run in a virtualenv which gets created and destroyed automatically. You can pass a python_version and a list of requirements to the task to build the virtual env.
Set up a docker registry and use a DockerOperator rather than a PythonOperator. This would allow teams to set up their own Docker images with specific requirements. This is how I think Heineken set up their airflow jobs as presented in their Airflow Meetup. I'm trying to see whether they posted their slides online but I can't seem to find them.
Related
So I'm trying to get my Intellij to see Apache Airflow that I downloaded. The steps I've taken so far:
I've downloaded the most recent Apache Airflow setup and saved the apache airflow 2.2.3 onto my desktop. I'm trying to get it to work with my Intellij, I've tried adding the Apache Airflow folder into the Library and Modules, both have come back with errors stating it's not being utilized. I've tried looking up documentation on it within Airflow but I'm not able to find any documentation on how to implement in your own IDE to write Python scripts for DAGs and other items?
How would I go about doing this as I'm at a complete loss of how to get Intellij to register that Apache Airflow is a Library to utilize for Python code so I can write DAG files correctly within the IDE itself.
Any help would be much appreciated as I've been stuck on this aspect for the past couple of days searching for any kind of documentation to make this work.
Airflow is both application and library. In your case you are not trying to run the application but only looking to write DAGs so you need it just as a library.
You should just open a virtual environment (preferably) and run:
pip install apache-airflow
Then you can write DAGs using the library and Intellij will let you know if you are using wrong imports or deprecated objects.
When your DAG file is ready deploy it to the DAG folder on the machine where Airflow is running.
I'm trying to set up an MWAA Airflow 2.0 environment that integrates S3 and GCP's Pub/Sub. While we have no problems with the environment being initialized, we're having trouble installing some dependencies and importing Python packages -- specifically apache-airflow-providers-google==2.2.0.
We've followed all of the instructions based on the official MWAA Python documentation. We already included the constraints file as prescribed by AWS, activated all Airflow logging configs, and tested the requirements.txt file using the MWAA local runner. The result when updating our MWAA environment's requirements would always be like this
When testing using the MWAA local runner, we observed that using the requirements.txt file with the constraints still takes forever to resolve. Installation takes more than 10-30 minutes which is no good.
As an experiment, we tried using a version of the requirements.txt file that omits the constraints and pinned versioning. Doing so installs the packages successfully and we don't receive import errors anymore on both MWAA local runner and our MWAA environment itself. However, all of our dags will fail to run no matter what. Airflow logs are also inaccessible whenever we do this.
The team and I have been trying to get MWAA environments up and running for our different applications and ETL pipelines but we just can't seem to get things to work smoothly. Any help would be appreciated!
I'm having the same problems and in the end we had to refactor a lot of things to remove the dependence. It looks like is a problem with PIP resolver and apache-airflow-providers-google if you look the official page:
https://pypi.org/project/apache-airflow-providers-google/2.0.0rc1/
In the WORST case, you may need to use Airflow direct on EC2 from docker image and abandon MWAA :(
I've been through similar issues but with different packages. There are certain things you need to take into consideration when using MWAA. I didn't have any issue testing the packages on the local runner then on MWAA using a public VPC, I only had issues when using a private VPC as the web server doesn't have an internet connection, so the method to get the packages to MWAA is different.
Things to take into consideration:
The version of the packages; test on the local runner if you can first
Enable the logs; The scheduler and web server logs can show you issues, but also they may not. The reason for this is Fargate serving the images, will try to roll back to a working state rather than have MWAA be in a non-working state. So, you might not see what the error actually is, it may even look like there were no errors in certain scenarios.
Check dependencies; You may need to download a package with pip download <package>==version. There you can inspect the contents of the .whl file and see if there are any dependencies. You may have extra notes that can point you in the right direction. In one case, using the Slack package wouldn't work until I also added the http package, even though Airflow includes this package.
So, yes it's serverless, and you may have an easy time installing/setting MWAA up, but be prepared to do a little investigation if it doesn't work. I did contact AWS support, but managed to solve it myself in the end. Other than trying the obvious things, only those that use MWAA frequently and have faced varying scenarios will be of any assistance.
I am trying to wrap my head around how DAGs should actually be executed in docker compose environment, when they are dependent on other service (separate python venv) defined in the compose file.
I have setup airflow via docker compose as mentioned in official documentation. Also, I have added a Django service, which has its own dependencies.
Now, i would like to have a DAG, that executes python script using that Django's service python environment (It also uses Django's models. Not sure if that's relevant).
The only way I see it working is with DockerOperator as described here. I managed to setup and execute the test DAG mentioned there, however when I try to run the real task, it fails due to networking issues. Iam quite confident i can solve that issue, but setting everything this way just seems like way too much hassle.
So, in the end I guess Iam wondering what the ideal architecture should when using Airflow via compose? Should the base airflow image be extended with my Django service (creating one hell of a big image) or is there a better way?
You can use PythonVirtualEnvOperator (https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html#pythonvirtualenvoperator) - but it will recreate django virtualenv every time task is run so not ideal
Another option will be to run DockerOperator or KubernetesPodOperator (if you use Kubernetes) and have separate image with Django installed (or even base Django image).
Adding Django to Airflow is probably not the best idea - Airflow has ~500 dependencies when installed with all providers, so chance is that you will have some difficults-to-resolve conflicts.
Also one of the things we consider for Airlfow 2.2 and beyond is to make better way of handling caching, which could help with building cacheable virtualenv created once and shared between workers/pods (but this is just in discussion phase)
You can check out tomorrow's session on Airflow Summit where we discuss what's coming (and generally Airflow Summit is cool):
https://airflowsummit.org/sessions/2021/looking-ahead-what-comes-after-airflow-2/
There's a similar question here but from 2018 were the solution requires changing the base image for the workers. Another suggestion is to ssh into each node and apt-get install there. This doesn't seem useful because when auto scale spawns new nodes, you'd need to do it again and again.
Anyway, is there a reasonable way to upgrade the base gcloud in late 2020?
Because task instances run in a shared execution environment, it is generally not recommended to use the gcloud CLI within Composer Airflow tasks, when possible, to avoid state or version conflicts. For example, if you have multiple users using the same Cloud Composer environment, and either of them changes the active credentials used by gcloud, then they can unknowingly break the other's workflows.
Instead, consider using the Cloud SDK Python libraries to do what you need to do programmatically, or use the airflow.providers.google.cloud operators, which may already have what you need.
If you really need to use the gcloud CLI and don't share the environment, then you can use a BashOperator with a install/upgrade script to create a prerequisite for any tasks that need to use the CLI. Alternatively, you can build a custom Docker image with gcloud installed, and use GKEPodOperator or KubernetesPodOperator to run a Kubernetes pod to run the CLI command. That would be slower, but more reliable than verifying dependencies each time.
Some background: I have set up Airflow on Kubernetes (on AWS). I am able to run DAGs that query a database, send emails or do anything that doesn't require a package that isn't already a part of Airflow. For example, if I try to run a DAG that uses the Facebook-business SDK the DAG will obviously break because the dependency isn't available. I've tried a couple different ways of trying to get this dependency, along with others, installed but haven't been successful.
I have tried to install python packages by modifying my scheduler and webserver deployments to do a pip install of my dependencies as part of an initContainer. When I do this, the DAG remains broken as it is unable to find the needed packages. When I open a shell to my pod I can see that the dependencies have not been installed (I check using pip list). I have also verified that there aren't other python/pip versions installed.
I have also tried to install the dependencies by running a pip install when I open a shell to my pod. This way is successful in installing the dependency in the correct place and also makes it available. However, instead of the webserver UI showing that my DAG is broken, I get the this dag isn't available in the webserver dagbag object message.
I would expect that running pip install as part of my initContainer or container would makes these dependencies available in my pod. However, this isn't the case. It's as if pip install runs without any issues, but by the time my pods are fully set up the python packages are nowhere to be found
I forgot to say that I have found a way to make it work, but it feels somewhat hacky and like there should be a better way
- If I open a shell to my webserver container and install the needed dependencies and then open a shell to my scheduler and do the same thing, the dependencies are found and the DAG works.
The init container is a separate docker instance. Unless you rig up some sort of shared storage for your python libraries (which is quite dubious) any pip installs in the init container won't impact the running container of the pod.
I see two options:
1) Modify the docker image that you're using to include the packages you need
2) Prepend a pip install to the command being run in the pod. It's not uncommon to string together a few commands with && between them, in order to execute a sequence of operations in a starting pod.
I would recommend updating your Airflow Docker image to include the libraries you need.
If you plan to use lots of different libraries for specific DAGs then it may be worth create multiple Docker images and then reference them at a task level.
MyOperator(...,
executor_config={
"KubernetesExecutor":
{"image": "myCustomDockerImage"}
}
)
Reference: baseoperator.py