I'm trying to set up an MWAA Airflow 2.0 environment that integrates S3 and GCP's Pub/Sub. While we have no problems with the environment being initialized, we're having trouble installing some dependencies and importing Python packages -- specifically apache-airflow-providers-google==2.2.0.
We've followed all of the instructions based on the official MWAA Python documentation. We already included the constraints file as prescribed by AWS, activated all Airflow logging configs, and tested the requirements.txt file using the MWAA local runner. The result when updating our MWAA environment's requirements would always be like this
When testing using the MWAA local runner, we observed that using the requirements.txt file with the constraints still takes forever to resolve. Installation takes more than 10-30 minutes which is no good.
As an experiment, we tried using a version of the requirements.txt file that omits the constraints and pinned versioning. Doing so installs the packages successfully and we don't receive import errors anymore on both MWAA local runner and our MWAA environment itself. However, all of our dags will fail to run no matter what. Airflow logs are also inaccessible whenever we do this.
The team and I have been trying to get MWAA environments up and running for our different applications and ETL pipelines but we just can't seem to get things to work smoothly. Any help would be appreciated!
I'm having the same problems and in the end we had to refactor a lot of things to remove the dependence. It looks like is a problem with PIP resolver and apache-airflow-providers-google if you look the official page:
https://pypi.org/project/apache-airflow-providers-google/2.0.0rc1/
In the WORST case, you may need to use Airflow direct on EC2 from docker image and abandon MWAA :(
I've been through similar issues but with different packages. There are certain things you need to take into consideration when using MWAA. I didn't have any issue testing the packages on the local runner then on MWAA using a public VPC, I only had issues when using a private VPC as the web server doesn't have an internet connection, so the method to get the packages to MWAA is different.
Things to take into consideration:
The version of the packages; test on the local runner if you can first
Enable the logs; The scheduler and web server logs can show you issues, but also they may not. The reason for this is Fargate serving the images, will try to roll back to a working state rather than have MWAA be in a non-working state. So, you might not see what the error actually is, it may even look like there were no errors in certain scenarios.
Check dependencies; You may need to download a package with pip download <package>==version. There you can inspect the contents of the .whl file and see if there are any dependencies. You may have extra notes that can point you in the right direction. In one case, using the Slack package wouldn't work until I also added the http package, even though Airflow includes this package.
So, yes it's serverless, and you may have an easy time installing/setting MWAA up, but be prepared to do a little investigation if it doesn't work. I did contact AWS support, but managed to solve it myself in the end. Other than trying the obvious things, only those that use MWAA frequently and have faced varying scenarios will be of any assistance.
Related
I am trying to wrap my head around how DAGs should actually be executed in docker compose environment, when they are dependent on other service (separate python venv) defined in the compose file.
I have setup airflow via docker compose as mentioned in official documentation. Also, I have added a Django service, which has its own dependencies.
Now, i would like to have a DAG, that executes python script using that Django's service python environment (It also uses Django's models. Not sure if that's relevant).
The only way I see it working is with DockerOperator as described here. I managed to setup and execute the test DAG mentioned there, however when I try to run the real task, it fails due to networking issues. Iam quite confident i can solve that issue, but setting everything this way just seems like way too much hassle.
So, in the end I guess Iam wondering what the ideal architecture should when using Airflow via compose? Should the base airflow image be extended with my Django service (creating one hell of a big image) or is there a better way?
You can use PythonVirtualEnvOperator (https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html#pythonvirtualenvoperator) - but it will recreate django virtualenv every time task is run so not ideal
Another option will be to run DockerOperator or KubernetesPodOperator (if you use Kubernetes) and have separate image with Django installed (or even base Django image).
Adding Django to Airflow is probably not the best idea - Airflow has ~500 dependencies when installed with all providers, so chance is that you will have some difficults-to-resolve conflicts.
Also one of the things we consider for Airlfow 2.2 and beyond is to make better way of handling caching, which could help with building cacheable virtualenv created once and shared between workers/pods (but this is just in discussion phase)
You can check out tomorrow's session on Airflow Summit where we discuss what's coming (and generally Airflow Summit is cool):
https://airflowsummit.org/sessions/2021/looking-ahead-what-comes-after-airflow-2/
I'm using the command func azure functionapp publish to publish my python function app to Azure. As best I can tell, the command only packages up the source code and transfers it to Azure, then on a remote machine in Azure, the function app is "built" and deployed. The build phase includes the collection of dependencies from pypi. Is there a way to override where it looks for these dependencies? I'd like to point it to my ow pypi server, or alternatively, provide the wheels locally in my source tree and have it use those. I have a few questions/problems:
Are my assumptions correct?
Assuming they are, is this possible, and how?
I've tried a few things, read some docs, looked at the various --help options in the CLI tool, I've set up a pip.conf file that I've verified works for local pip usage, then on purpose "broken it" and tried to see if the publish would fail (it did not, so this leads me to believe it ignores pip.conf, or the build (and collection of dependencies happens on the remote end). I'm at a loss and any tips, pointers, or answers are appreciated!
You can add additional pip source to point to your own pypi server. Check https://learn.microsoft.com/en-us/azure/azure-functions/functions-reference-python#remote-build-with-extra-index-url
Remote build with extra index URL:
When your packages are available from an accessible custom package index, use a remote build. Before publishing, make sure to create an app setting named PIP_EXTRA_INDEX_URL. The value for this setting is the URL of your custom package index. Using this setting tells the remote build to run pip install using the --extra-index-url option. To learn more, see the Python pip install documentation.
You can also use basic authentication credentials with your extra package index URLs. To learn more, see Basic authentication credentials in Python documentation.
And regarding referring local packages, that is also possible. Check https://learn.microsoft.com/en-us/azure/azure-functions/functions-reference-python#install-local-packages
I hope both of your questions are answered now.
Some background: I have set up Airflow on Kubernetes (on AWS). I am able to run DAGs that query a database, send emails or do anything that doesn't require a package that isn't already a part of Airflow. For example, if I try to run a DAG that uses the Facebook-business SDK the DAG will obviously break because the dependency isn't available. I've tried a couple different ways of trying to get this dependency, along with others, installed but haven't been successful.
I have tried to install python packages by modifying my scheduler and webserver deployments to do a pip install of my dependencies as part of an initContainer. When I do this, the DAG remains broken as it is unable to find the needed packages. When I open a shell to my pod I can see that the dependencies have not been installed (I check using pip list). I have also verified that there aren't other python/pip versions installed.
I have also tried to install the dependencies by running a pip install when I open a shell to my pod. This way is successful in installing the dependency in the correct place and also makes it available. However, instead of the webserver UI showing that my DAG is broken, I get the this dag isn't available in the webserver dagbag object message.
I would expect that running pip install as part of my initContainer or container would makes these dependencies available in my pod. However, this isn't the case. It's as if pip install runs without any issues, but by the time my pods are fully set up the python packages are nowhere to be found
I forgot to say that I have found a way to make it work, but it feels somewhat hacky and like there should be a better way
- If I open a shell to my webserver container and install the needed dependencies and then open a shell to my scheduler and do the same thing, the dependencies are found and the DAG works.
The init container is a separate docker instance. Unless you rig up some sort of shared storage for your python libraries (which is quite dubious) any pip installs in the init container won't impact the running container of the pod.
I see two options:
1) Modify the docker image that you're using to include the packages you need
2) Prepend a pip install to the command being run in the pod. It's not uncommon to string together a few commands with && between them, in order to execute a sequence of operations in a starting pod.
I would recommend updating your Airflow Docker image to include the libraries you need.
If you plan to use lots of different libraries for specific DAGs then it may be worth create multiple Docker images and then reference them at a task level.
MyOperator(...,
executor_config={
"KubernetesExecutor":
{"image": "myCustomDockerImage"}
}
)
Reference: baseoperator.py
My basic question: How would I set an environment variable that will be in effect during the Elastic Beanstalk deploy process?
I am not talking about setting environment variables during deployment that will be accessible by my application after it is deployed, I want to set environment variables that will modify a specific behavior of Elastic Beanstalk's build scripts.
To be clear - I generally think this is a bad idea, but it might be OK in this case so I am trying this out as an experiment. Here is some background about why I am looking into this, and why I think it might be OK:
I am in the process of transferring a server from AWS in the US to AWS in China, and am finding that server deploys fail between 50% ~ 100% of the time, depending on the day. This is a major pain during development, but I am primarily concerned about how I am going to make this work in production.
This is an Amazon Linux server running Python 2.7, and logs indicate that the failures are mainly Read Timeout Errors, with a few Connection Reset by Peers thrown in once in a while, all generated by pip install while attempting to download packages from pypi. To verify this I have ssh'd into my instances to manually install a few packages, and on a small sample size see similar failure rates. Note that this is pretty common when trying to access content on the other side of China's GFW.
So, I wrote a script that pip downloads the packages to my local machine, then aws syncs them to an S3 bucket located in the same region as my server. This would eliminate the need to cross the GFW while deploying.
My original plan was to add an .ebextension that aws cps the packages from S3 to the pip cache, but (unless I missed something) this somewhat surprisingly doesn't appear to be straight forward.
So, as plan B I am redirecting the packages into a local directory on the instance. This is working well, but I can't get pip install to pull packages from the local directory rather than downloading the packages from pypi.
Following the pip documentation, I expected that pointing the PIP_FIND_LINKS environment variable to my package directory would have pip "naturally" pull packages from my directory, rather than pypi. Which would make the change transparent to the EB build scripts, and why I thought that this might be a reasonable solution.
So far I have tried:
1) a command which exports PIP_FIND_LINKS=/path/to/package, with no luck. I assumed that this was due to the deploy step being called from a different session, so I then tried:
2) a command which (in addition to the previous export) appends export PIP_FIND_LINKS=/path/to/package to ~./profile, in an attempt to have this apply to any new sessions.
I have tried issuing the commands by both ec2_user and root, and neither works.
Rather than keep poking a stick at this, I was hoping that someone with a bit more experience with the nuances of EB, pip, etc might be able to provide some guidance.
After some thought I decided that a pip config file should be a more reliable solution than environment variables.
This turned out to be easy to implement with .ebextensions. I first create the download script, then create the config file directly in the virtualenv folder:
files:
/home/ec2-user/download_packages.sh:
mode: "000500"
owner: root
group: root
content: |
#!/usr/bin/env bash
package_dir=/path/to/packages
mkdir -p $package_dir
aws s3 sync s3://bucket/packages $package_dir
/opt/python/run/venv/pip.conf:
mode: "000755"
owner: root
group: root
content: |
[install]
find-links = file:///path/to/packages
no-index=false
Finally, a command is used to call the script that we just created:
commands:
03_download_packages:
command: bash /home/ec2-user/download_packages.sh
One potential issue is that pip bypasses the local package directory and downloads packages that are stored in our private git repo, so there is still potential for timeout errors, but these represent just a small fraction of the packages that need to be installed so it should be workable.
Still unsure if this will be a long-term solution, but it is very simple and (after just one day of testing...) failure rates have fallen from 50% ~ 100% to 0%.
I have a directory of python programs, classes and packages that I currently distribute to 5 servers. It seems I'm continually going to be adding more servers and right now I'm just doing a basic rsync over from my local box to the servers.
What would a better approach be for distributing code across n servers?
thanks
I use Mercurial with fabric to deploy all the source code. Fabric's written in python, so it'll be easy for you to get started. Updating the production service is as simple as fab production deploy. Which ends ups doing something like this:
Shut down all the services and put an "Upgrade in Progress" page.
Update the source code directory.
Run all migrations.
Start up all services.
It's pretty awesome seeing this all happen automatically.
First, make sure to keep all code under revision control (if you're not already doing that), so that you can check out new versions of the code from a repository instead of having to copy it to the servers from your workstation.
With revision control in place you can use a tool such as Capistrano to automatically check out the code on each server without having to log in to each machine and do a manual checkout.
With such a setup, deploying a new version to all servers can be as simple as running
$ cap deploy
from your local machine.
While I also use version control to do this, another approach you might consider is to package up the source using whatever package management your host systems use (for example RPMs or dpkgs), and set up the systems to use a custom repository Then an "apt-get upgrade" or "yum update" will update the software on the systems. Then you could use something like "mussh" to run the stop/update/start commands on all the tools.
Ideally, you'd push it to a "testing" repository first, have your staging systems install it, and once the testing of that was signed off on you could move it to the production repository.
It's very similar to the recommendations of using fabric or version control in general, just another alternative which may suit some people better.
The downside to using packages is that you're probably using version control anyway, and you do have to manage version numbers of these packages. I do this using revision tags within my version control, so I could just as easily do an "svn update" or similar on the destination systems.
In either case, you may need to consider the migration from one version to the next. If a user loads a page that contains references to other elements, you do the update and those elements go away, what do you do? You may wish to do something either within your deployment scripting, or within your code where you first push out a version with the new page, but keep the old referenced elements, deploy that, and then remove the referenced elements and deploy that later.
In this way users won't see broken elements within the page.