My basic question: How would I set an environment variable that will be in effect during the Elastic Beanstalk deploy process?
I am not talking about setting environment variables during deployment that will be accessible by my application after it is deployed, I want to set environment variables that will modify a specific behavior of Elastic Beanstalk's build scripts.
To be clear - I generally think this is a bad idea, but it might be OK in this case so I am trying this out as an experiment. Here is some background about why I am looking into this, and why I think it might be OK:
I am in the process of transferring a server from AWS in the US to AWS in China, and am finding that server deploys fail between 50% ~ 100% of the time, depending on the day. This is a major pain during development, but I am primarily concerned about how I am going to make this work in production.
This is an Amazon Linux server running Python 2.7, and logs indicate that the failures are mainly Read Timeout Errors, with a few Connection Reset by Peers thrown in once in a while, all generated by pip install while attempting to download packages from pypi. To verify this I have ssh'd into my instances to manually install a few packages, and on a small sample size see similar failure rates. Note that this is pretty common when trying to access content on the other side of China's GFW.
So, I wrote a script that pip downloads the packages to my local machine, then aws syncs them to an S3 bucket located in the same region as my server. This would eliminate the need to cross the GFW while deploying.
My original plan was to add an .ebextension that aws cps the packages from S3 to the pip cache, but (unless I missed something) this somewhat surprisingly doesn't appear to be straight forward.
So, as plan B I am redirecting the packages into a local directory on the instance. This is working well, but I can't get pip install to pull packages from the local directory rather than downloading the packages from pypi.
Following the pip documentation, I expected that pointing the PIP_FIND_LINKS environment variable to my package directory would have pip "naturally" pull packages from my directory, rather than pypi. Which would make the change transparent to the EB build scripts, and why I thought that this might be a reasonable solution.
So far I have tried:
1) a command which exports PIP_FIND_LINKS=/path/to/package, with no luck. I assumed that this was due to the deploy step being called from a different session, so I then tried:
2) a command which (in addition to the previous export) appends export PIP_FIND_LINKS=/path/to/package to ~./profile, in an attempt to have this apply to any new sessions.
I have tried issuing the commands by both ec2_user and root, and neither works.
Rather than keep poking a stick at this, I was hoping that someone with a bit more experience with the nuances of EB, pip, etc might be able to provide some guidance.
After some thought I decided that a pip config file should be a more reliable solution than environment variables.
This turned out to be easy to implement with .ebextensions. I first create the download script, then create the config file directly in the virtualenv folder:
files:
/home/ec2-user/download_packages.sh:
mode: "000500"
owner: root
group: root
content: |
#!/usr/bin/env bash
package_dir=/path/to/packages
mkdir -p $package_dir
aws s3 sync s3://bucket/packages $package_dir
/opt/python/run/venv/pip.conf:
mode: "000755"
owner: root
group: root
content: |
[install]
find-links = file:///path/to/packages
no-index=false
Finally, a command is used to call the script that we just created:
commands:
03_download_packages:
command: bash /home/ec2-user/download_packages.sh
One potential issue is that pip bypasses the local package directory and downloads packages that are stored in our private git repo, so there is still potential for timeout errors, but these represent just a small fraction of the packages that need to be installed so it should be workable.
Still unsure if this will be a long-term solution, but it is very simple and (after just one day of testing...) failure rates have fallen from 50% ~ 100% to 0%.
Related
I'm trying to set up an MWAA Airflow 2.0 environment that integrates S3 and GCP's Pub/Sub. While we have no problems with the environment being initialized, we're having trouble installing some dependencies and importing Python packages -- specifically apache-airflow-providers-google==2.2.0.
We've followed all of the instructions based on the official MWAA Python documentation. We already included the constraints file as prescribed by AWS, activated all Airflow logging configs, and tested the requirements.txt file using the MWAA local runner. The result when updating our MWAA environment's requirements would always be like this
When testing using the MWAA local runner, we observed that using the requirements.txt file with the constraints still takes forever to resolve. Installation takes more than 10-30 minutes which is no good.
As an experiment, we tried using a version of the requirements.txt file that omits the constraints and pinned versioning. Doing so installs the packages successfully and we don't receive import errors anymore on both MWAA local runner and our MWAA environment itself. However, all of our dags will fail to run no matter what. Airflow logs are also inaccessible whenever we do this.
The team and I have been trying to get MWAA environments up and running for our different applications and ETL pipelines but we just can't seem to get things to work smoothly. Any help would be appreciated!
I'm having the same problems and in the end we had to refactor a lot of things to remove the dependence. It looks like is a problem with PIP resolver and apache-airflow-providers-google if you look the official page:
https://pypi.org/project/apache-airflow-providers-google/2.0.0rc1/
In the WORST case, you may need to use Airflow direct on EC2 from docker image and abandon MWAA :(
I've been through similar issues but with different packages. There are certain things you need to take into consideration when using MWAA. I didn't have any issue testing the packages on the local runner then on MWAA using a public VPC, I only had issues when using a private VPC as the web server doesn't have an internet connection, so the method to get the packages to MWAA is different.
Things to take into consideration:
The version of the packages; test on the local runner if you can first
Enable the logs; The scheduler and web server logs can show you issues, but also they may not. The reason for this is Fargate serving the images, will try to roll back to a working state rather than have MWAA be in a non-working state. So, you might not see what the error actually is, it may even look like there were no errors in certain scenarios.
Check dependencies; You may need to download a package with pip download <package>==version. There you can inspect the contents of the .whl file and see if there are any dependencies. You may have extra notes that can point you in the right direction. In one case, using the Slack package wouldn't work until I also added the http package, even though Airflow includes this package.
So, yes it's serverless, and you may have an easy time installing/setting MWAA up, but be prepared to do a little investigation if it doesn't work. I did contact AWS support, but managed to solve it myself in the end. Other than trying the obvious things, only those that use MWAA frequently and have faced varying scenarios will be of any assistance.
I'm using the command func azure functionapp publish to publish my python function app to Azure. As best I can tell, the command only packages up the source code and transfers it to Azure, then on a remote machine in Azure, the function app is "built" and deployed. The build phase includes the collection of dependencies from pypi. Is there a way to override where it looks for these dependencies? I'd like to point it to my ow pypi server, or alternatively, provide the wheels locally in my source tree and have it use those. I have a few questions/problems:
Are my assumptions correct?
Assuming they are, is this possible, and how?
I've tried a few things, read some docs, looked at the various --help options in the CLI tool, I've set up a pip.conf file that I've verified works for local pip usage, then on purpose "broken it" and tried to see if the publish would fail (it did not, so this leads me to believe it ignores pip.conf, or the build (and collection of dependencies happens on the remote end). I'm at a loss and any tips, pointers, or answers are appreciated!
You can add additional pip source to point to your own pypi server. Check https://learn.microsoft.com/en-us/azure/azure-functions/functions-reference-python#remote-build-with-extra-index-url
Remote build with extra index URL:
When your packages are available from an accessible custom package index, use a remote build. Before publishing, make sure to create an app setting named PIP_EXTRA_INDEX_URL. The value for this setting is the URL of your custom package index. Using this setting tells the remote build to run pip install using the --extra-index-url option. To learn more, see the Python pip install documentation.
You can also use basic authentication credentials with your extra package index URLs. To learn more, see Basic authentication credentials in Python documentation.
And regarding referring local packages, that is also possible. Check https://learn.microsoft.com/en-us/azure/azure-functions/functions-reference-python#install-local-packages
I hope both of your questions are answered now.
What I have
I have 2 python apps that share a few bits of code, enough that I am trying to isolate the shared parts into modules/packages/libraries (I'm keeping the term vague on purpose, as I am not sure what the solution is). All my code is in a monorepo, because I am hoping to overcome some of the annoyances of managing more repos than we have team members.
Currently my file layout looks like:
+ myproject
+ appA
| + python backend A
| + js frontend
+ appB
| + B stuff
+ libs
+ lib1
+ lib2
Both appA and appB use lib1 and lib2 (they are essentially data models to abstract away the shared database). appA is a webapp with several components, not all of which are python. It is deployed as a docker stack that involve a bunch of containers.
I manage my dependencies with poetry to ensure reproducible builds, etc... Each python component (appA, appB...) have their own pyproject.toml file, virtual env, etc...
appB is deployed separately.
All development is on linux, if it makes any difference.
What I need
I am looking for a clean way to deal with the libs:
development for appA is done in a local docker-compose setup. The backend auto-reloads on file changes (using a docker volume), and I would like it to happen for changes in the libs too.
development for appB is simpler, but is moving to docker so the problem will be the same.
What I've tried
My initial "solution" was to copy the libs folder over to a temporary location for development in appA. It works for imports, but it's messy as soon as I want to change the libs code (which is still quite often), as I need to change the original file, copy over, rebuild the container.
I tried symlinking the libs into the backend's docker environment, but symlinks don't seem to work well with docker (it did not seem to follow the link, so the files don't end up in the docker image, unless I essentially copy the files inside the docker build context, which defeats the purpose of the link.)
I have tried packaging each lib into a python package, and install them via poetry add ../../libs/lib1 which doesn't really work inside docker because the paths don't match, and then I'm back to the symlink issue.
I am sure there is a clean way to do this, but I can't figure it out. I know I could break up the repo into smaller ones and install dependencies, but development would still cause problems inside docker, as I would still need to rebuild the container each time I change the lib files, so I would rather keep the monorepo.
If you are using docker-compose anyway you could use volumes to mount the local libs in your container and be able to edit them in your host system and the container. Not super fancy, but that should work, right?
#ckaserer your suggestion seems to work, indeed. In short, in the docker files I do COPY ../libs/lib1 /app/lib1 and then for local development, I mount ../libs/lib1 onto /app/lib1. That gives me the behavior I was looking for. I use a split docker-compose file for this. The setup causes a few issues with various tools needing some extra config so they know that the libs are part of the code base, but nothing impossible. Thanks for the idea!
So even though it's not an ideal solution locally mounting over the app and lib directories works on Linux systems.
FYI: On Windows hosts you might run into trouble if you want to watch for file changes as that is not propagated from a windows host to a Linux container.
Some background: I have set up Airflow on Kubernetes (on AWS). I am able to run DAGs that query a database, send emails or do anything that doesn't require a package that isn't already a part of Airflow. For example, if I try to run a DAG that uses the Facebook-business SDK the DAG will obviously break because the dependency isn't available. I've tried a couple different ways of trying to get this dependency, along with others, installed but haven't been successful.
I have tried to install python packages by modifying my scheduler and webserver deployments to do a pip install of my dependencies as part of an initContainer. When I do this, the DAG remains broken as it is unable to find the needed packages. When I open a shell to my pod I can see that the dependencies have not been installed (I check using pip list). I have also verified that there aren't other python/pip versions installed.
I have also tried to install the dependencies by running a pip install when I open a shell to my pod. This way is successful in installing the dependency in the correct place and also makes it available. However, instead of the webserver UI showing that my DAG is broken, I get the this dag isn't available in the webserver dagbag object message.
I would expect that running pip install as part of my initContainer or container would makes these dependencies available in my pod. However, this isn't the case. It's as if pip install runs without any issues, but by the time my pods are fully set up the python packages are nowhere to be found
I forgot to say that I have found a way to make it work, but it feels somewhat hacky and like there should be a better way
- If I open a shell to my webserver container and install the needed dependencies and then open a shell to my scheduler and do the same thing, the dependencies are found and the DAG works.
The init container is a separate docker instance. Unless you rig up some sort of shared storage for your python libraries (which is quite dubious) any pip installs in the init container won't impact the running container of the pod.
I see two options:
1) Modify the docker image that you're using to include the packages you need
2) Prepend a pip install to the command being run in the pod. It's not uncommon to string together a few commands with && between them, in order to execute a sequence of operations in a starting pod.
I would recommend updating your Airflow Docker image to include the libraries you need.
If you plan to use lots of different libraries for specific DAGs then it may be worth create multiple Docker images and then reference them at a task level.
MyOperator(...,
executor_config={
"KubernetesExecutor":
{"image": "myCustomDockerImage"}
}
)
Reference: baseoperator.py
My server admin restricts the disk-space to about 50 Mb. The default gcloud install (with alpha) on linux takes about 150 Mb. I need to reduce the install size to fit my drive space.
I tried using pyinstaller (https://www.pyinstaller.org/
) on lib/gcloud.py , since bin/gcloud is a bash script. The resulting executable (in lib/dist) did not work.
I also tried to zip some of the libs (lib/surface, and some others) and added the resulting .zip files to sys.path in lib/gcloud.py. This should allow zipimport to use these zips while conserving disk space.
While this approach reduced the size to below 50 Mb, and works quite well for some gcloud options, it does not work for cloud-shell.
I noticed that there are a lot of .pyc files along with the .py files. For example both gcloud.py and gcloud.pyc are present in lib/. Now this seems like a waste, so I ran python -m compileall . in the root folder followed by find . -iname '*.py' -delete . This also did not work. But it did reduce the disk space below 40 Mb.
I am most interested in using gcloud alpha cloud-shell, and not the other apis. Using the above approach (.zip files appended to sys.path) gives this error with gcloud alpha cloud-shell ssh/scp
ERROR: gcloud crashed (IOError): [Errno 20] Not a directory
A zipfile of a fully functional gcloud installation directory comes to under 20 Mb. So there has got to be a way to fit it in 50 Mb. Any ideas?
UPDATE:
If you comfortable with using the oauth2 workflow, see joffre's answer below.
Personally, I find it quite troublesome to use oauth2. Infact one of the major benefits of the gcloud CLI for me is that once gcloud init is done, all auth problems are solved.
In the byte-compile approach I tried earlier, __init__.py files were also getting removed. *.json files also seem not essential to functionality (they might have help strings though)
python -m compileall .
find . iname '*.py' -not -iname '__init__.py' -delete
find . -iname '*.json' -delete
This brings down the total install size to 40-45 Mb.
note that it is also possible to do the reverse i.e. delete all *.pyc while keeping all *.py . This will also reduce disk-space, but not by as much (since most *.pyc seem to be smaller than the corresponding *.py files)
You don't need the gcloud CLI to connect to your Cloud Shell.
If you run gcloud alpha cloud-shell ssh --log-http, you'll see what the tool is actually doing, so you can manually replicate this.
First, be sure that your SSH public key is in the environment, which can be done through the API (it does not even need to be done from the server you're trying to connect from).
Then, you have to start the environment, which can be done through this API endpoint, and you have to wait until the returned operation is finished, which can be done through this other API endpoint. Note that this can be done from your environment (which would require oauth authentication), or you can do this from an external service (e.g. program a Cloud Function to start the Cloud Shell environment when you call a specific endpoint).
Finally, once the environment is started, you need to get the information for connecting to your Cloud Shell instance though this API endpoint (which, again, does not even need to be done from the server you're connecting from), and finally connect through SSH from the server with that information.
This will limit the tooling required on your server to a simple SSH client (which is likely already pre-installed).
With the links I provided, you can do all of this manually and check if this properly works. However, this is tedious to do manually, so I would likely create a Cloud Function that makes all the required API calls, and returns the connection information on the body of the request. I might even be lazy enough to have the function return the explicit ssh command that needs to be run, so once I'm connected to the server, I just need to run curl <my_function_URL>|sh and everything will work.
If you try to do something like this, be extremely sure to verify that this is secure on your setup (so, be sure to not add unneeded keys on your Cloud Shell environment), since I'm just writing this from the top of my head, and having an exposed Cloud Function feels somewhat insecure (anybody calling that Cloud Function would at least know the IP of your Cloud Shell environment). But at least, this is an idea that you could explore.