I am trying to wrap my head around how DAGs should actually be executed in docker compose environment, when they are dependent on other service (separate python venv) defined in the compose file.
I have setup airflow via docker compose as mentioned in official documentation. Also, I have added a Django service, which has its own dependencies.
Now, i would like to have a DAG, that executes python script using that Django's service python environment (It also uses Django's models. Not sure if that's relevant).
The only way I see it working is with DockerOperator as described here. I managed to setup and execute the test DAG mentioned there, however when I try to run the real task, it fails due to networking issues. Iam quite confident i can solve that issue, but setting everything this way just seems like way too much hassle.
So, in the end I guess Iam wondering what the ideal architecture should when using Airflow via compose? Should the base airflow image be extended with my Django service (creating one hell of a big image) or is there a better way?
You can use PythonVirtualEnvOperator (https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html#pythonvirtualenvoperator) - but it will recreate django virtualenv every time task is run so not ideal
Another option will be to run DockerOperator or KubernetesPodOperator (if you use Kubernetes) and have separate image with Django installed (or even base Django image).
Adding Django to Airflow is probably not the best idea - Airflow has ~500 dependencies when installed with all providers, so chance is that you will have some difficults-to-resolve conflicts.
Also one of the things we consider for Airlfow 2.2 and beyond is to make better way of handling caching, which could help with building cacheable virtualenv created once and shared between workers/pods (but this is just in discussion phase)
You can check out tomorrow's session on Airflow Summit where we discuss what's coming (and generally Airflow Summit is cool):
https://airflowsummit.org/sessions/2021/looking-ahead-what-comes-after-airflow-2/
Related
I'm trying to set up an MWAA Airflow 2.0 environment that integrates S3 and GCP's Pub/Sub. While we have no problems with the environment being initialized, we're having trouble installing some dependencies and importing Python packages -- specifically apache-airflow-providers-google==2.2.0.
We've followed all of the instructions based on the official MWAA Python documentation. We already included the constraints file as prescribed by AWS, activated all Airflow logging configs, and tested the requirements.txt file using the MWAA local runner. The result when updating our MWAA environment's requirements would always be like this
When testing using the MWAA local runner, we observed that using the requirements.txt file with the constraints still takes forever to resolve. Installation takes more than 10-30 minutes which is no good.
As an experiment, we tried using a version of the requirements.txt file that omits the constraints and pinned versioning. Doing so installs the packages successfully and we don't receive import errors anymore on both MWAA local runner and our MWAA environment itself. However, all of our dags will fail to run no matter what. Airflow logs are also inaccessible whenever we do this.
The team and I have been trying to get MWAA environments up and running for our different applications and ETL pipelines but we just can't seem to get things to work smoothly. Any help would be appreciated!
I'm having the same problems and in the end we had to refactor a lot of things to remove the dependence. It looks like is a problem with PIP resolver and apache-airflow-providers-google if you look the official page:
https://pypi.org/project/apache-airflow-providers-google/2.0.0rc1/
In the WORST case, you may need to use Airflow direct on EC2 from docker image and abandon MWAA :(
I've been through similar issues but with different packages. There are certain things you need to take into consideration when using MWAA. I didn't have any issue testing the packages on the local runner then on MWAA using a public VPC, I only had issues when using a private VPC as the web server doesn't have an internet connection, so the method to get the packages to MWAA is different.
Things to take into consideration:
The version of the packages; test on the local runner if you can first
Enable the logs; The scheduler and web server logs can show you issues, but also they may not. The reason for this is Fargate serving the images, will try to roll back to a working state rather than have MWAA be in a non-working state. So, you might not see what the error actually is, it may even look like there were no errors in certain scenarios.
Check dependencies; You may need to download a package with pip download <package>==version. There you can inspect the contents of the .whl file and see if there are any dependencies. You may have extra notes that can point you in the right direction. In one case, using the Slack package wouldn't work until I also added the http package, even though Airflow includes this package.
So, yes it's serverless, and you may have an easy time installing/setting MWAA up, but be prepared to do a little investigation if it doesn't work. I did contact AWS support, but managed to solve it myself in the end. Other than trying the obvious things, only those that use MWAA frequently and have faced varying scenarios will be of any assistance.
I have a flask server which consists of multiple methods. I am aiming to automate the execution of these methods by using Airflow.
I am thinking of using the following steps:-
Setting up Airflow by defining multiple DAGS to call the relevant flask methods in a pipeline.
Deploying Flask Server.
Deploying Airflow (using docker-compose).
Mainly, I am thinking to seperate the Airflow and flask servers independently. Do you think this is a good plan? Any other suggestions would be highly appreciated.
It depends on a couple of things.
Can you run the methods from inside Airflow? For security reasons it is often required to keep some functionality in a different environment/cluster. Reasons for that could be the required database access that you want to give to the Airflow environment.
Is this functionality/methods also invoked from other locations or is it solely for Airflow?
What other functionality does the flask server have that you can't live without?
Are there python dependency conflicts? Even in that case you could use the VirtualEnvOperator of Airflow.
If there is no answer here that is completely blocking you from invoking these methods from inside Airflow, I would vote to do them completely inside Airflow. This will reduce coupling and also reduce the maintenance burden for you in the long term. Besides, Airflow will prevent you from needing to worry about a lot of things, like connectivity, exception codes and callbacks for when something went wrong.
This question is related to understanding a concept regarding the DockerOperator and Apache Airflow, so I am not sure if this site is the correct place. If not, please let me know where I can post it.
The situation is the following: I am working with a Windows laptop, I have a developed very basic ETL pipeline that extracts data from some server and writes the unprocessed data into a MongoDB on a scheduled basis with Apache-Airflow. I have a docker-compose.yml file with three services: A mongo service for the MongoDB, a mongo-express service as admin tool for the MongoDB, a webserver service for Apache-Airflow and a postgres service as database backend for Apache-Airflow.
So far, I have developed some Python code in functions and these functions are being called by the Airflow instance using the PythonOperator. Since debugging is very difficult using the PythonOperator, I want to try the DockerOperator now instead. I have been following this tutorial which claims that using the DockerOperator, you can develop your source code independent of the operating system the code will later be executed on due to Docker's concept 'build once, run everywhere'.
My problem is that I didn't fully understand all necessary steps needed to run code using the DockerOperator. In the tutorial, I have the following questions regarding the Task Development and Deployment:
Package the artifacts together with all dependencies into a Docker image. ==> Does this mean that I have to create a Dockerfile for every task and then build an image using this Dockerfile?
Expose an Entrypoint from your container to invoke and parameterize a task using the DockerOperator. ==> How do you do this?
Thanks for your time, I highly appreciate it!
Typically you're going to have a Docker image that handles one type of task. So for any one pipeline you'd probably be using a variety of different Docker images, one different one for each step.
There are a couple of considerations here in regards to your question which is specifically around deployment.
You'll need to create a Docker image. You likely want to add a tag to this as you will want to version the image. The DockerOperator defaults to the latest tag on an image.
The image needs to be available to your deployed instance of Airflow. They can be built on the machine you're running Airflow on if you're wanting to run it locally. If you've deployed Airflow somewhere online, the more common practice would be to push them to a cloud service. There are a number of providers you can use (Docker Hub, Amazon ECR, etc...).
Expose an Entrypoint from your container to invoke and parameterize a task using the DockerOperator. ==> How do you do this?
If you have your image built, and is available to Airflow you simply need to create a task using the DockerOperator like so:
dag = DAG(**kwargs)
task_1 = DockerOperator(
dag=dag,
task_id='docker_task',
image='dummyorg/dummy_api_tools:v1',
auto_remove=True,
docker_url='unix://var/run/docker.sock',
command='python extract_from_api_or_something.py'
)
I'd recommend investing some time into understanding Docker. It's a little bit difficult to wrap your head around at first but it's a highly valuable tool, especially for systems like Airflow.
I've got a little question about dependency management for packages used in python operators
We are using airflow in a industralized mode to run scheduled python jobs. it works well but we are facing issues to deal with different python lib needed for each DAG.
Do you have any idea on how to let developers install their own dependencies for their jobs without being admin and being sure that these dependencies don't collide with other jobs ?
Would you recommend having a bash task that loads a virtual env at the beginning of the job ? Any official recommandation to do it ?
Thanks !
Romain.
In general I see two possible solutions for your problem:
Airflow has a PythonVirtualEnvOperator which allows a task to run in a virtualenv which gets created and destroyed automatically. You can pass a python_version and a list of requirements to the task to build the virtual env.
Set up a docker registry and use a DockerOperator rather than a PythonOperator. This would allow teams to set up their own Docker images with specific requirements. This is how I think Heineken set up their airflow jobs as presented in their Airflow Meetup. I'm trying to see whether they posted their slides online but I can't seem to find them.
Similar question was asked here, however the solution does not give the shell access to the same environment as the deployment. If I inspect os.environ from within the shell, none of the environment variables appear.
Is there a way to run the manage.py shell with the environment?
PS: As a little side question, I know the mantra for EBS is to stop using eb ssh, but then how would you run one-off management scripts (that you don't want to run on every deploy)?
One of the cases you have to run something once is db schema migrations. Usually you store information about that in the db... So you can use db to sync / ensure that something was triggered only once.
Personally I have nothing against using eb ssh, I see problems with it however. If you want to have CI/CD, that manual operation is against the rules.
Looks like you are referring to WWW/API part of Beanstalk. If you need something that is quite frequent... maybe worker is more suitable? Problem here is that if API goes deployed first you would have wrong schema.
In general you are using EC2, so it's user data stores information that spins up you service. So there you can put your "stuff". Still you need to sync / ensure. Here are docs for beanstalk - for more information how to do that.
Edit
Beanstalk is kind of instrumentation on top of EC2. So there must be a way to work with it, since you have access to user data of that EC2s. No worries you don't need to dig that deep. There is good way of instrumenting your server. It is called ebextensions. It can be used to put files on the server, trigger commands, instrument cron. What ever you want.
You can create ebextension, with container_commands this time Python Configuration Namespaces section. That commands are executed on each deployment. Still, problem is that you need to sync since more then one deployment can go at the same time. Good part is that you can set env in the way you want.
I have no problem to access to the environment variables. How did you get the problem? Try do prepare page with the map.