I've just seen that my web applications docker image is enormous. A 600-MB reason is the packages I install for it. The biggest single offender is botocore with 77.7 MB.
Apparently this is known behavior: https://github.com/boto/botocore/issues/1629
Is it possible to redue that size?
Analysis
The tar.gz distribution is just 10.8 MB: https://pypi.org/project/botocore/#files
75MB are in the data directory
For every single AWS service, there seem to be multiple folders (some kind of versioning?) and a service-2.json
The service-2.json files probably use most of the space. They are not minified and they contain a lot of information that seems not to be necessary for running a production system (e.g. description).
Is there a way to either completely avoid botocore or in any other way reduce botocores size for the Docker image? (I'm only using S3)
I've trained my model locally and now I want to use it in my Kubernetes cluster. Unfortunately, all the Docker images for Pytorch are 5+ GBs because they contain the scripts for training which I won't need now. I've created my own image which is only 3.5 GBs but still huge. Is there a slim Pytorch version for predictions? If not, which parts of the package can I safely remove and how?
No easy answer for Python version of PyTorch unfortunately (or at least none I’m aware of).
Python, in general, is not well-suited for Docker deployments as it carries over the dependencies (even if you don't need all of their functionality, imports are often at the top of the file making your aforementioned removal infeasible for projects of PyTorch size and complexity).
There is a way out though...
torchscript
Given your trained model you can convert it to traced/scripted version (see here). After you manage that:
Inference in other languages
Write your inference code in another language, either Java or C++(see here for more info).
I have only used C++, but you might get there easier with Java, I think.
Results
Managed to get PyTorch for CPU inference to roughly ~32MB, GPU would weight more and be way more complex though and would probably need ~1GB of CUDNN dependency itself.
C++ way
Please note torchlambda project is not currently maintained and I’m the creator, hopefully it gives you some tips at least.
See:
Dockerfile for the image build
CMake used for building
Docs for more info about compilation options etc.
C++ inference code
Additional notes:
It also uses AWS SDKs and you would have to remove them from at least these files
You don't need static compilation - it would help to reach the lowest possible (I could come up with) image size, but not strictly necessary (additional ‘100MB’ or so)
Final
Try Java first as it’s packaging is probably saner (although final image would probably be a little bigger)
The C++ way not tested for the newest PyTorch version and might be subject to change with basically any release
In general it takes A LOT of time and debugging, unfortunately.
I have built a python wrapper for some JAR binaries and I want to distribute it to PyPi. The problem is that the size of these JARs is quite large. They exceed the PyPi limit size of 60MB (the current size is about 200MB or more). What are the best practices for packaging in such cases? I got the following idea but do not know if there is a better practice.
I will save these binaries somewhere and download them with a script in the main init function in the wrapper code or during the installation step. This solution seems to be quite good but Could you recommend a good repository to save these binaries? I may suggest DropBox and Google Drive but I feel that they do not fit for this case!
By the way, is it possible to download files during the installation step?
Thanks for your help,
You're on the right track, move the dependencies out of your package and download them on installation / first use (just be sure you include a progress indicator of some kind so people know what is happening, since it may take some minutes for dependencies that large to be downloaded and you don't want them to think it's hanging.
I'd avoid things like Dropbox or Google Drive (especially Drive) since they are notoriously slow as download mirrors. Instead, try something like AWS S3 or Google Cloud Storage. Wrap CloudFront around it as a CDN too if you want improved latency regionally.
Hope this helps!
I have a python project with a lot of dependencies (around 30 or so python packages) that i want to deploy on aws using lambda function. Right now i have about 30 custom python packages in my VS solution that i import into the main function - there is a lot of code. What is the best way to build a deployment package and how would i go about doing this?
I watched a few tutorials but i am new to this, so im not sure exactly what concrete steps to take. If i use something like zappa and create a virtual environment how would i then get my project there and install all the dependencies and then zip the file?
Thanks so much, sorry for the stupid questions, i couldn't find a stackoverflow post that covered this
Just go to your python environment folder and found site-package folder (usually in /lib), choose all the dependencies you need and zip them with your code.
I guess it's the easiest way.
For example, I may need beautifulsoup and urllib for dependencies, just zip them (and their dependencies, if needed) with my code, then upload to AWS Lambda, that's all.
BTW, you can also see this gist to know whether the module you need can be directly import to AWS Lambda or not.
I'd like to create some ridiculously-easy-to-use pip packages for loading common machine-learning datasets in Python. (Yes, some stuff already exists, but I want it to be even simpler.)
What I'd like to achieve is this:
User runs pip install dataset
pip downloads the dataset, say via wget http://mydata.com/data.tar.gz. Note that the data does not reside in the python package itself, but is downloaded from somewhere else.
pip extracts the data from this file and puts it in the directory that the package is installed in. (This isn't ideal, but the datasets are pretty small, so let's assume storing the data here isn't a big deal.)
Later, when the user imports my module, the module automatically loads the data from the specific location.
This question is about bullets 2 and 3. Is there a way to do this with setuptools?
As alluded to by Kevin, Python package installs should be completely reproducible, and any potential external-download issues should be pushed to runtime. This therefore shouldn't be handled with setuptools.
Instead, to avoid burdening the user, consider downloading the data in a lazy way, upon load. Example:
def download_data(url='http://...'):
# Download; extract data to disk.
# Raise an exception if the link is bad, or we can't connect, etc.
def load_data():
if not os.path.exists(DATA_DIR):
download_data()
data = read_data_from_disk(DATA_DIR)
return data
We could then describe download_data in the docs, but the majority of users would never need to bother with it. This is somewhat similar to the behavior in the imageio module with respect to downloading necessary decoders at runtime, rather than making the user manage the external downloads themselves.
Note that the data does not reside in the python package itself, but is downloaded from somewhere else.
Please do not do this.
The whole point of Python packaging is to provide a completely deterministic, repeatable, and reusable means of installing exactly the same thing every time. Your proposal has the following problems at a minimum:
The end user might download your package on computer A, stick it on a thumb drive, and then install it on computer B which does not have internet.
The data on the web might change, meaning that two people who install the same exact package get different results.
The website that provides the data might cease to exist or unwisely change the URL, meaning people who still have the package won't be able to use it.
The user could be behind an internet filter, and you might get a useless "this page is blocked" HTML file instead of the dataset you were expecting.
Instead, you should either include your data with the package (using the package_data or data_files arguments to setup()), or provide a separate top-level function in your Python code to download the data manually when the user is ready to do so.
Python package installation states that it should never execute Python code in order to install Python packages. This means that you may not be able to download stuff during the installation process.
If you want to download some additional data, do it after you install the package , for example when you import your package you could download this data and cache it somewhere in order not to download it at every new import.
This question is rather old, but I want to add that downloading external data at installation time is of course much better than forcing to download external content at runtime.
The original problem is, that one cannot package arbitrary content into a Python package, if it exceeds the max. size limit of the package registry. This size limit effectively breaks up the relationship of the packaged Python code and the data it operates on. Suddenly things that belong together have to be separated and the package creator needs to take care about versioning and availability of external data. If the size limits are met, everything is installed at installation time and the discussion would be over here. I want to stress, that data & algorithms belong together and are normally installed at the same time, not at some later date. That's the whole point of package integrity. If you cannot install a package, because the external content cannot be downloaded, you want to know at installation time.
In the light of Docker & friends, downloading data at runtime makes a container non-reproducible and forces the download of the external content at each start of the container unless you additionally add the path where the data is downloaded to a Docker volume. But then you need to know where exactly this content is downloaded and the user/Dockerfile creator has to know more unnecessary details. There are more issues in using volumes in that regard.
Moreover, content fetched at runtime cannot be cached automatically by Docker, i.e. you need to fetch every time after a docker build.
Then again one could argue, that one should provide a function/executable script that downloads this external content and the user should execute this script directly after installation. Again the user of the package needs to know more than necessary, because someone or some commitee proclaims, executing Python code or downloading external content at installation time is not "recommended".
But forcing the user to run an extra script directly after installation of a package is factually the same as downloading the content directly inside a post-installation step, just more user-unfriendly. Thinking about how popular machine learning is today, the growing size of models and popularity of ML in the future, there will be a lot of scripts to be executed for just a handful of Python package dependencies for model downloads in the near future according to this argumentation.
The only time I see a benefit for an extra script, is when you can choose to download between several different versions of the external content, but then one intentionally involves the user into that decision.
But coming back to the runtime on-demand lazy model download, where the user doesn't need to be involved into executing an extra script: let's assume, the user packages the container, all tests pass successfully on the CI and he/she distributes it to Dockerhub or any other container registry and starts production. Nobody then wants the situation of random fails, because a successfully installed package intermittently downloads content from time to time e.g. after some maintainence task happens like cleaning up docker volumes or if distributing containers on new k8s nodes and the first request to a web app times out because external content is always fetched at startup. Or not fetched at all, because the external URL is in maintenance mode. That's a nightmare!
If it would be allowed to have reasonably sized Python packages, the whole problem would be much less of an issue. E.g. in contrast, the biggest Ruby gems (i.e. packages in the Ruby ecosystem) are over 700MB big and of course it's allowed to download external content at installation time.