I've just seen that my web applications docker image is enormous. A 600-MB reason is the packages I install for it. The biggest single offender is botocore with 77.7 MB.
Apparently this is known behavior: https://github.com/boto/botocore/issues/1629
Is it possible to redue that size?
Analysis
The tar.gz distribution is just 10.8 MB: https://pypi.org/project/botocore/#files
75MB are in the data directory
For every single AWS service, there seem to be multiple folders (some kind of versioning?) and a service-2.json
The service-2.json files probably use most of the space. They are not minified and they contain a lot of information that seems not to be necessary for running a production system (e.g. description).
Is there a way to either completely avoid botocore or in any other way reduce botocores size for the Docker image? (I'm only using S3)
Related
We have a number of Python Databricks jobs that all use the same underlying Wheel package to install their dependencies. Installing this Wheel package even with a node that has been idling in a Pool still takes 90 seconds.
Some of these jobs are very long-running so we would like to use Jobs computer clusters for the lower cost in DBUs.
Some of these jobs are much shorter-running (<10 seconds) where the 90 second install time seems more significant. We have been considering using a hot cluster (All-Purpose Compute) for these shorter jobs. We would like to avoid the extra cost of the All-Purpose Compute if possible.
Reading the Databricks documentation suggests that the Idle instances in the Pool are reserved for us but not costing us DBUs. Is there a way for us to pre-install the required libraries on our Idle instances so that when a job comes through we are able to immediately start processing it?
Is there an alternate approach that can fulfill a similar use case?
You can't install libraries directly into nodes from pool, because the actual code is executed in the Docker container corresponding to Databricks Runtime. There are several ways to speedup installation of the libraries:
Create your own Docker image with all necessary libraries pre-installed, and pre-load Databricks Runtime version and your Docker image - this part couldn't be done via UI, so you need to use REST API (see description of preloaded_docker_images attribute), databrick-cli, or Databricks Terraform provider. The main disadvantage of custom Docker images is that some functionality isn't available out of box, for example, arbitrary files in Repos, web terminal, etc. (don't remember full list)
Put all necessary libraries and their dependencies onto DBFS and install them via cluster init script. It's very important that you collect binary dependencies, not packages only with the source code, so you won't need to compile them when installing. This could be done once:
for Python this could be done with pip download --prefer-binary lib1 lib2 ...
for Java/Scala you can use mvn dependency:get -Dartifact=<maven_coordinates>, that will download dependencies into ~/.m2/repository folder, from which you can copy jars to DBFS and in init script use cp /dbfs/.../jars/* /databricks/jars/ command
for R, it's slightly more complicated, but is also doable
I have built a python wrapper for some JAR binaries and I want to distribute it to PyPi. The problem is that the size of these JARs is quite large. They exceed the PyPi limit size of 60MB (the current size is about 200MB or more). What are the best practices for packaging in such cases? I got the following idea but do not know if there is a better practice.
I will save these binaries somewhere and download them with a script in the main init function in the wrapper code or during the installation step. This solution seems to be quite good but Could you recommend a good repository to save these binaries? I may suggest DropBox and Google Drive but I feel that they do not fit for this case!
By the way, is it possible to download files during the installation step?
Thanks for your help,
You're on the right track, move the dependencies out of your package and download them on installation / first use (just be sure you include a progress indicator of some kind so people know what is happening, since it may take some minutes for dependencies that large to be downloaded and you don't want them to think it's hanging.
I'd avoid things like Dropbox or Google Drive (especially Drive) since they are notoriously slow as download mirrors. Instead, try something like AWS S3 or Google Cloud Storage. Wrap CloudFront around it as a CDN too if you want improved latency regionally.
Hope this helps!
I need some advice.
I trained an image classifier using Tensorflow and wanted to deploy it to AWS Lambda using serverless. The directory includes the model, some python modules including tensorflow and numpy, and python code. The size of the complete folder before unzipping is 340 MB, which gets rejected by AWS lambda with an error message saying "The unzipped state must be smaller than 262144000 bytes".
How should I approach this? Can I not deploy packages like these on AWS Lambda?
Note: In the requirements.txt file, there are 2 modules listed including numpy and tensorflow. (Tensorflow is a big module)
I know I am answering it very late .. just putting it here for reference for other people..
I did the following things -
Delete /external/* /tensorflow/contrib/* /tensorflow/include/unsupported/* files as suggested here.
Strip all .so files especially two files in site-packages/numpy/core - _multiarray_umath.cpython-36m-x86_64-linux-gnu.so and _multiarray_tests.cpython-36m-x86_64-linux-gnu.so. Strip considerably reduces their size.
You can put your model in S3 bucket and download it at runtime. This will reduce the size of the zip. This is explained in detail here.
If this does not work then there are some additional things that can be done like removing pyc files etc as mentioned here
You can maybe use the ephemeral disk capacity, (/tmp) that have a limit of 512Mb, but in your case, memory will still be an issue.
The best choice can be to use an AWS batch, if serverless does not manage it, you can even keep a lambda to trigger your batch
The best way to do it would be to use the Serverless Framework as outlined in this article. It helps to zip them using a docker image which mimics Amazon's linux environment. Additionally, it automatically uses S3 as the code repository for your Lambda which increases the size limit. The article provided is an extremely helpful guide and is the same way that developers use tensorflow and other large libraries on AWS.
If you're still running into the 250MB size limit, you can try to follow this article which uses the same python-requirements-plugin as the previous article, but with the option -slim: true. This will help you to optimally compress your packages by removing unnecessary files from them, which allows you to decrease your package size before AND after unzipping.
I'd like to create some ridiculously-easy-to-use pip packages for loading common machine-learning datasets in Python. (Yes, some stuff already exists, but I want it to be even simpler.)
What I'd like to achieve is this:
User runs pip install dataset
pip downloads the dataset, say via wget http://mydata.com/data.tar.gz. Note that the data does not reside in the python package itself, but is downloaded from somewhere else.
pip extracts the data from this file and puts it in the directory that the package is installed in. (This isn't ideal, but the datasets are pretty small, so let's assume storing the data here isn't a big deal.)
Later, when the user imports my module, the module automatically loads the data from the specific location.
This question is about bullets 2 and 3. Is there a way to do this with setuptools?
As alluded to by Kevin, Python package installs should be completely reproducible, and any potential external-download issues should be pushed to runtime. This therefore shouldn't be handled with setuptools.
Instead, to avoid burdening the user, consider downloading the data in a lazy way, upon load. Example:
def download_data(url='http://...'):
# Download; extract data to disk.
# Raise an exception if the link is bad, or we can't connect, etc.
def load_data():
if not os.path.exists(DATA_DIR):
download_data()
data = read_data_from_disk(DATA_DIR)
return data
We could then describe download_data in the docs, but the majority of users would never need to bother with it. This is somewhat similar to the behavior in the imageio module with respect to downloading necessary decoders at runtime, rather than making the user manage the external downloads themselves.
Note that the data does not reside in the python package itself, but is downloaded from somewhere else.
Please do not do this.
The whole point of Python packaging is to provide a completely deterministic, repeatable, and reusable means of installing exactly the same thing every time. Your proposal has the following problems at a minimum:
The end user might download your package on computer A, stick it on a thumb drive, and then install it on computer B which does not have internet.
The data on the web might change, meaning that two people who install the same exact package get different results.
The website that provides the data might cease to exist or unwisely change the URL, meaning people who still have the package won't be able to use it.
The user could be behind an internet filter, and you might get a useless "this page is blocked" HTML file instead of the dataset you were expecting.
Instead, you should either include your data with the package (using the package_data or data_files arguments to setup()), or provide a separate top-level function in your Python code to download the data manually when the user is ready to do so.
Python package installation states that it should never execute Python code in order to install Python packages. This means that you may not be able to download stuff during the installation process.
If you want to download some additional data, do it after you install the package , for example when you import your package you could download this data and cache it somewhere in order not to download it at every new import.
This question is rather old, but I want to add that downloading external data at installation time is of course much better than forcing to download external content at runtime.
The original problem is, that one cannot package arbitrary content into a Python package, if it exceeds the max. size limit of the package registry. This size limit effectively breaks up the relationship of the packaged Python code and the data it operates on. Suddenly things that belong together have to be separated and the package creator needs to take care about versioning and availability of external data. If the size limits are met, everything is installed at installation time and the discussion would be over here. I want to stress, that data & algorithms belong together and are normally installed at the same time, not at some later date. That's the whole point of package integrity. If you cannot install a package, because the external content cannot be downloaded, you want to know at installation time.
In the light of Docker & friends, downloading data at runtime makes a container non-reproducible and forces the download of the external content at each start of the container unless you additionally add the path where the data is downloaded to a Docker volume. But then you need to know where exactly this content is downloaded and the user/Dockerfile creator has to know more unnecessary details. There are more issues in using volumes in that regard.
Moreover, content fetched at runtime cannot be cached automatically by Docker, i.e. you need to fetch every time after a docker build.
Then again one could argue, that one should provide a function/executable script that downloads this external content and the user should execute this script directly after installation. Again the user of the package needs to know more than necessary, because someone or some commitee proclaims, executing Python code or downloading external content at installation time is not "recommended".
But forcing the user to run an extra script directly after installation of a package is factually the same as downloading the content directly inside a post-installation step, just more user-unfriendly. Thinking about how popular machine learning is today, the growing size of models and popularity of ML in the future, there will be a lot of scripts to be executed for just a handful of Python package dependencies for model downloads in the near future according to this argumentation.
The only time I see a benefit for an extra script, is when you can choose to download between several different versions of the external content, but then one intentionally involves the user into that decision.
But coming back to the runtime on-demand lazy model download, where the user doesn't need to be involved into executing an extra script: let's assume, the user packages the container, all tests pass successfully on the CI and he/she distributes it to Dockerhub or any other container registry and starts production. Nobody then wants the situation of random fails, because a successfully installed package intermittently downloads content from time to time e.g. after some maintainence task happens like cleaning up docker volumes or if distributing containers on new k8s nodes and the first request to a web app times out because external content is always fetched at startup. Or not fetched at all, because the external URL is in maintenance mode. That's a nightmare!
If it would be allowed to have reasonably sized Python packages, the whole problem would be much less of an issue. E.g. in contrast, the biggest Ruby gems (i.e. packages in the Ruby ecosystem) are over 700MB big and of course it's allowed to download external content at installation time.
I am using EC2 box with 4GB RAM and disk space of 8GB. I inspected my directories all of which were supposed to contain python scripts. I found that my /.git folder is 1.8GB in size. Most of the space was taken by ./.git/objects folder. Out of that too very few folders are in ~200MBs.Is it too much of a size? Can I just delete those folders? I am running out of disk space so cannot install more packages using pip.
I am 40 commits ahead on EC2. I don't want to push those commits because there is something in the history which is huge in size(some 100MB which I unknowingly committed) and it does not allow me to do so. I looked at the solution online and found that it is not so easy. Given the amount of time that I have I decided to edit scripts on my local machine and push those to git and then pull them into my remote machine.
You should take a look at this thread to find out what happened (possible duplicate): How to find/identify large files/commits in Git history?
Then take a look at BFG Repo Cleaner to clean up your repository.
You should avoid commiting binary files because even deleted, they will always be in the .git/objects folder, and you should never delete files from there yourself.