Minimizing disk-space of gcloud CLI installation

Minimizing disk-space of gcloud CLI installation - python

My server admin restricts the disk-space to about 50 Mb. The default gcloud install (with alpha) on linux takes about 150 Mb. I need to reduce the install size to fit my drive space.
I tried using pyinstaller (https://www.pyinstaller.org/
) on lib/gcloud.py , since bin/gcloud is a bash script. The resulting executable (in lib/dist) did not work.
I also tried to zip some of the libs (lib/surface, and some others) and added the resulting .zip files to sys.path in lib/gcloud.py. This should allow zipimport to use these zips while conserving disk space.
While this approach reduced the size to below 50 Mb, and works quite well for some gcloud options, it does not work for cloud-shell.
I noticed that there are a lot of .pyc files along with the .py files. For example both gcloud.py and gcloud.pyc are present in lib/. Now this seems like a waste, so I ran python -m compileall . in the root folder followed by find . -iname '*.py' -delete . This also did not work. But it did reduce the disk space below 40 Mb.
I am most interested in using gcloud alpha cloud-shell, and not the other apis. Using the above approach (.zip files appended to sys.path) gives this error with gcloud alpha cloud-shell ssh/scp
ERROR: gcloud crashed (IOError): [Errno 20] Not a directory
A zipfile of a fully functional gcloud installation directory comes to under 20 Mb. So there has got to be a way to fit it in 50 Mb. Any ideas?
UPDATE:
If you comfortable with using the oauth2 workflow, see joffre's answer below.
Personally, I find it quite troublesome to use oauth2. Infact one of the major benefits of the gcloud CLI for me is that once gcloud init is done, all auth problems are solved.
In the byte-compile approach I tried earlier, __init__.py files were also getting removed. *.json files also seem not essential to functionality (they might have help strings though)
python -m compileall .
find . iname '*.py' -not -iname '__init__.py' -delete
find . -iname '*.json' -delete
This brings down the total install size to 40-45 Mb.
note that it is also possible to do the reverse i.e. delete all *.pyc while keeping all *.py . This will also reduce disk-space, but not by as much (since most *.pyc seem to be smaller than the corresponding *.py files)

You don't need the gcloud CLI to connect to your Cloud Shell.
If you run gcloud alpha cloud-shell ssh --log-http, you'll see what the tool is actually doing, so you can manually replicate this.
First, be sure that your SSH public key is in the environment, which can be done through the API (it does not even need to be done from the server you're trying to connect from).
Then, you have to start the environment, which can be done through this API endpoint, and you have to wait until the returned operation is finished, which can be done through this other API endpoint. Note that this can be done from your environment (which would require oauth authentication), or you can do this from an external service (e.g. program a Cloud Function to start the Cloud Shell environment when you call a specific endpoint).
Finally, once the environment is started, you need to get the information for connecting to your Cloud Shell instance though this API endpoint (which, again, does not even need to be done from the server you're connecting from), and finally connect through SSH from the server with that information.
This will limit the tooling required on your server to a simple SSH client (which is likely already pre-installed).
With the links I provided, you can do all of this manually and check if this properly works. However, this is tedious to do manually, so I would likely create a Cloud Function that makes all the required API calls, and returns the connection information on the body of the request. I might even be lazy enough to have the function return the explicit ssh command that needs to be run, so once I'm connected to the server, I just need to run curl <my_function_URL>|sh and everything will work.
If you try to do something like this, be extremely sure to verify that this is secure on your setup (so, be sure to not add unneeded keys on your Cloud Shell environment), since I'm just writing this from the top of my head, and having an exposed Cloud Function feels somewhat insecure (anybody calling that Cloud Function would at least know the IP of your Cloud Shell environment). But at least, this is an idea that you could explore.

Related

Spleeter on Google Cloud Functions

Im trying to run spleeter lib on Google Cloud Functions.
I have a problem. The separate_to_file function is creating pretrained_models files while executing in root folder.
The only directory I can write something is /tmp.
Is there any way to change directory of pretrained models?

You can set the MODEL_DIRECTORY environment variable to the path of the directory you want models to be written into before to run Spleeter. Please note that most models are quite heavy and may require lot of storage.

I was doing the same thing. Fetching the models can be solved by pointing to the /tmp directory, as already pointed out.
Unfortunately, that wasn't the end of it. Spleeter depends on linux binaries that weren't included in serverless enironments. I solved this by deploying a docker image instead of a plain script.
Then there was the issue that spleeter consumes a lot of memory, especially for 4-way and 5-way splits. Google Cloud doesn't offer enough RAM. AWS Lambdas offer 10GB RAM, which is enough to split a regular, radio-friendly song.

How to organize shared libraries with docker and monorepo

What I have
I have 2 python apps that share a few bits of code, enough that I am trying to isolate the shared parts into modules/packages/libraries (I'm keeping the term vague on purpose, as I am not sure what the solution is). All my code is in a monorepo, because I am hoping to overcome some of the annoyances of managing more repos than we have team members.
Currently my file layout looks like:
+ myproject
+ appA
| + python backend A
| + js frontend
+ appB
| + B stuff
+ libs
+ lib1
+ lib2
Both appA and appB use lib1 and lib2 (they are essentially data models to abstract away the shared database). appA is a webapp with several components, not all of which are python. It is deployed as a docker stack that involve a bunch of containers.
I manage my dependencies with poetry to ensure reproducible builds, etc... Each python component (appA, appB...) have their own pyproject.toml file, virtual env, etc...
appB is deployed separately.
All development is on linux, if it makes any difference.
What I need
I am looking for a clean way to deal with the libs:
development for appA is done in a local docker-compose setup. The backend auto-reloads on file changes (using a docker volume), and I would like it to happen for changes in the libs too.
development for appB is simpler, but is moving to docker so the problem will be the same.
What I've tried
My initial "solution" was to copy the libs folder over to a temporary location for development in appA. It works for imports, but it's messy as soon as I want to change the libs code (which is still quite often), as I need to change the original file, copy over, rebuild the container.
I tried symlinking the libs into the backend's docker environment, but symlinks don't seem to work well with docker (it did not seem to follow the link, so the files don't end up in the docker image, unless I essentially copy the files inside the docker build context, which defeats the purpose of the link.)
I have tried packaging each lib into a python package, and install them via poetry add ../../libs/lib1 which doesn't really work inside docker because the paths don't match, and then I'm back to the symlink issue.
I am sure there is a clean way to do this, but I can't figure it out. I know I could break up the repo into smaller ones and install dependencies, but development would still cause problems inside docker, as I would still need to rebuild the container each time I change the lib files, so I would rather keep the monorepo.

If you are using docker-compose anyway you could use volumes to mount the local libs in your container and be able to edit them in your host system and the container. Not super fancy, but that should work, right?
#ckaserer your suggestion seems to work, indeed. In short, in the docker files I do COPY ../libs/lib1 /app/lib1 and then for local development, I mount ../libs/lib1 onto /app/lib1. That gives me the behavior I was looking for. I use a split docker-compose file for this. The setup causes a few issues with various tools needing some extra config so they know that the libs are part of the code base, but nothing impossible. Thanks for the idea!
So even though it's not an ideal solution locally mounting over the app and lib directories works on Linux systems.
FYI: On Windows hosts you might run into trouble if you want to watch for file changes as that is not propagated from a windows host to a Linux container.

securely executing untrusted python code on server

I have a server and a front end, I would like to get python code from the user in the front end and execute it securely on the backend.
I read this article explaining the problematic aspect of each approach - eg pypl sandbox, docker etc.
I can define the input the code needs, and the output, and can put all of them in a directory, what's important to me is that this code should not be able to hurt my filesystem, and create a disk, memory, cpu overflow (I need to be able to set timeouts, and prevent it from accesing files that are not devoted to it)
What is the best practice here? Docker? Kubernetes? Is there a python module for that?
Thanks.

You can have a python docker container with the volume mount. Volume mount will be the directory in local system for code which will be available in your docker container also. By doing this you have isolated user supplied code only to the container when it runs.
Scan your python container with CIS benchmark for better security

environment variables applied during elastic beanstalk deploy

My basic question: How would I set an environment variable that will be in effect during the Elastic Beanstalk deploy process?
I am not talking about setting environment variables during deployment that will be accessible by my application after it is deployed, I want to set environment variables that will modify a specific behavior of Elastic Beanstalk's build scripts.
To be clear - I generally think this is a bad idea, but it might be OK in this case so I am trying this out as an experiment. Here is some background about why I am looking into this, and why I think it might be OK:
I am in the process of transferring a server from AWS in the US to AWS in China, and am finding that server deploys fail between 50% ~ 100% of the time, depending on the day. This is a major pain during development, but I am primarily concerned about how I am going to make this work in production.
This is an Amazon Linux server running Python 2.7, and logs indicate that the failures are mainly Read Timeout Errors, with a few Connection Reset by Peers thrown in once in a while, all generated by pip install while attempting to download packages from pypi. To verify this I have ssh'd into my instances to manually install a few packages, and on a small sample size see similar failure rates. Note that this is pretty common when trying to access content on the other side of China's GFW.
So, I wrote a script that pip downloads the packages to my local machine, then aws syncs them to an S3 bucket located in the same region as my server. This would eliminate the need to cross the GFW while deploying.
My original plan was to add an .ebextension that aws cps the packages from S3 to the pip cache, but (unless I missed something) this somewhat surprisingly doesn't appear to be straight forward.
So, as plan B I am redirecting the packages into a local directory on the instance. This is working well, but I can't get pip install to pull packages from the local directory rather than downloading the packages from pypi.
Following the pip documentation, I expected that pointing the PIP_FIND_LINKS environment variable to my package directory would have pip "naturally" pull packages from my directory, rather than pypi. Which would make the change transparent to the EB build scripts, and why I thought that this might be a reasonable solution.
So far I have tried:
1) a command which exports PIP_FIND_LINKS=/path/to/package, with no luck. I assumed that this was due to the deploy step being called from a different session, so I then tried:
2) a command which (in addition to the previous export) appends export PIP_FIND_LINKS=/path/to/package to ~./profile, in an attempt to have this apply to any new sessions.
I have tried issuing the commands by both ec2_user and root, and neither works.
Rather than keep poking a stick at this, I was hoping that someone with a bit more experience with the nuances of EB, pip, etc might be able to provide some guidance.

After some thought I decided that a pip config file should be a more reliable solution than environment variables.
This turned out to be easy to implement with .ebextensions. I first create the download script, then create the config file directly in the virtualenv folder:
files:
/home/ec2-user/download_packages.sh:
mode: "000500"
owner: root
group: root
content: |
#!/usr/bin/env bash
package_dir=/path/to/packages
mkdir -p $package_dir
aws s3 sync s3://bucket/packages $package_dir
/opt/python/run/venv/pip.conf:
mode: "000755"
owner: root
group: root
content: |
[install]
find-links = file:///path/to/packages
no-index=false
Finally, a command is used to call the script that we just created:
commands:
03_download_packages:
command: bash /home/ec2-user/download_packages.sh
One potential issue is that pip bypasses the local package directory and downloads packages that are stored in our private git repo, so there is still potential for timeout errors, but these represent just a small fraction of the packages that need to be installed so it should be workable.
Still unsure if this will be a long-term solution, but it is very simple and (after just one day of testing...) failure rates have fallen from 50% ~ 100% to 0%.

What source bundle file types are supported?

I'm deploying an app to AWS Elastic Beanstalk using the API:
https://elasticbeanstalk.us-east-1.amazon.com/?ApplicationName=SampleApp
&SourceBundle.S3Bucket=amazonaws.com
&SourceBundle.S3Key=sample.war
...
My impression from reading around a bit is that Java deployments use .war, .zips are supported (docs) and that one can use .git (but only with PHP or using eb? doc).
Can I use the API to create an application version from a .git for a Python app? Or are zips the only type supported?
(Alternatively, can I git push to AWS without using the commandline tools?)

There are two ways to deploy to AWS:
The API backend, where it is basically a .zip file referenced from S3. When deploying, the Instance will unpack and run some custom scripts (which you can override from your AMI, or via Custom Configuration Files, which are the recommended way). Note that in order to create and deploy a new version in an AWS Elastic Beanstalk Environment, you need three calls: upload to s3, Create Application Version, and UpdateEnvironment.
The git endpoint, which works like this:
You install the AWS Elastic Beanstalk DevTools, and run a setup script on your git repo
When ran, the setup script patches your .git/config in order to support git aws.push and in particular, git aws.remote (which is not documented)
git aws.push simply takes your keys, builds a custom URL (git aws.remote), and does a git push -f master
Once AWS receives this (url is basically <api>/<app>/<commitid>(/<envname>), it creates the s3 .zip file (from the commit contents), then the application version on <app> for <commitid> and if <envname> is present, it also issues a UpdateEnvironment call. Your AWS ids are hashed and embedded into the URL just like all AWS calls, but sent as username / password auth tokens.
(full reference docs)
I've ported that as a Maven Plugin a few months ago, and this file show how it is done in plain Java. It actually covers a lot of code (since it actually builds a custom git repo - using jgit, calculates the hashes and pushes into it)
I'm strongly considering backporting as a ant task, or perhaps simply make it work without a pom.xml file present, so users only use maven to do the deployment.
Historically, only the first method was supported, while the second grew up in importance. Since the second is actually far easier (in beanstalk-maven-plugin, you have to call three different methods while a simply git push does all the three), we're supporting git-based deployments, and even published an archetype for it (you see a sample project here, especially the README.md in particular).
(btw, if you're using .war files, my elastic beanstalk plugin support both ways, and we're actually in favor of git, since it allows us to some incremental deployments)
So, you wanna still implement it?
There are three files I suggest you read:
FastDeployMojo.java is the main façade
RequestSigner does the real magic
This is a testcase for RequestSigner
Wanna do in
Python? I'd look for Dulwich
C#? The powershell version is based in it
Ruby? The Linux is based on it
Java? Use mine, it uses jgit

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.