I have built a python wrapper for some JAR binaries and I want to distribute it to PyPi. The problem is that the size of these JARs is quite large. They exceed the PyPi limit size of 60MB (the current size is about 200MB or more). What are the best practices for packaging in such cases? I got the following idea but do not know if there is a better practice.
I will save these binaries somewhere and download them with a script in the main init function in the wrapper code or during the installation step. This solution seems to be quite good but Could you recommend a good repository to save these binaries? I may suggest DropBox and Google Drive but I feel that they do not fit for this case!
By the way, is it possible to download files during the installation step?
Thanks for your help,
You're on the right track, move the dependencies out of your package and download them on installation / first use (just be sure you include a progress indicator of some kind so people know what is happening, since it may take some minutes for dependencies that large to be downloaded and you don't want them to think it's hanging.
I'd avoid things like Dropbox or Google Drive (especially Drive) since they are notoriously slow as download mirrors. Instead, try something like AWS S3 or Google Cloud Storage. Wrap CloudFront around it as a CDN too if you want improved latency regionally.
Hope this helps!
Related
I have made a python package named 'Panclus' but I do not know if anyone is using this package or not. Please tell me how to see how many people have installed my python package. Is it possible to get the name of users?
From here:
PyPI does not display download statistics for a number of reasons:
Inefficient to make work with a Content Distribution Network (CDN):
Download statistics change constantly. Including them in project
pages, which are heavily cached, would require invalidating the cache
more often, and reduce the overall effectiveness of the cache.
Highly inaccurate: A number of things prevent the download counts from
being accurate, some of which include:
pip’s download cache (lowers download counts)
Internal or unofficial mirrors (can both raise or lower download
counts)
Packages not hosted on PyPI (for comparisons sake)
Unofficial scripts or attempts at download count inflation (raises
download counts)
Known historical data quality issues (lowers download counts)
Not particularly useful: Just because a project has been downloaded a
lot doesn’t mean it’s good; Similarly just because a project hasn’t
been downloaded a lot doesn’t mean it’s bad!
Get more information about download statistics from this.
Apologies if this has been asked before, I've been reading about this issue for a while and all solutions seem similar to the same 3 available options.
For example in this thread Load local data files to Colaboratory
It is explained how to manually upload a file. That could work. But what if we had to share a file with 100 users? I believe that with that type of solutions, they would all have to copy the collaboratory project to their local machine, as usual, and everyone would have to upload the same file, and then proceed to use the code, once they have the required CSV.
Am I missing something? Is there any way to share both the project and a number of files (even if that implies a longer load time since the file has to be downloaded for ever person, but I would like to know if there is an automatic way to do this).
Particularly interested in solutions that do not involve auth tokens since the people that would get the shared project are not technologically shavy enough for the task.
If the data is not secret, git may be the best solution. You upload those csv files to github. Then use git clone in your Colab notebook.
I have a wheel built on MS Windows running in a very restricted environment, (cannot connect to internet). I can copy it to my machine running Linux. Then, I'd like to upload it to private PyPi.
I don't want to use twine. I had too much bad experience with Python infrastructure tools, so would like to avoid them as much as possible, but if this is not reason enough for you, think about it as "learning experience": I just really want to know what API do I need to use in order to put a file on PyPi server.
To spare you some more effort: https://pypiserver.readthedocs.io/en/latest/ I also read this, and there's no useful info here as well.
The only thing I could find in terms of documentation is this: https://www.python.org/dev/peps/pep-0503/ which is useless for my case.
This is the closest I've gotten so far: https://github.com/python/cpython/blob/master/Lib/distutils/command/upload.py#L92 though it still leaves a lot to be desired, as in: what fields are actually necessary and the restrictions on the contents of the fields.
I need some advice.
I trained an image classifier using Tensorflow and wanted to deploy it to AWS Lambda using serverless. The directory includes the model, some python modules including tensorflow and numpy, and python code. The size of the complete folder before unzipping is 340 MB, which gets rejected by AWS lambda with an error message saying "The unzipped state must be smaller than 262144000 bytes".
How should I approach this? Can I not deploy packages like these on AWS Lambda?
Note: In the requirements.txt file, there are 2 modules listed including numpy and tensorflow. (Tensorflow is a big module)
I know I am answering it very late .. just putting it here for reference for other people..
I did the following things -
Delete /external/* /tensorflow/contrib/* /tensorflow/include/unsupported/* files as suggested here.
Strip all .so files especially two files in site-packages/numpy/core - _multiarray_umath.cpython-36m-x86_64-linux-gnu.so and _multiarray_tests.cpython-36m-x86_64-linux-gnu.so. Strip considerably reduces their size.
You can put your model in S3 bucket and download it at runtime. This will reduce the size of the zip. This is explained in detail here.
If this does not work then there are some additional things that can be done like removing pyc files etc as mentioned here
You can maybe use the ephemeral disk capacity, (/tmp) that have a limit of 512Mb, but in your case, memory will still be an issue.
The best choice can be to use an AWS batch, if serverless does not manage it, you can even keep a lambda to trigger your batch
The best way to do it would be to use the Serverless Framework as outlined in this article. It helps to zip them using a docker image which mimics Amazon's linux environment. Additionally, it automatically uses S3 as the code repository for your Lambda which increases the size limit. The article provided is an extremely helpful guide and is the same way that developers use tensorflow and other large libraries on AWS.
If you're still running into the 250MB size limit, you can try to follow this article which uses the same python-requirements-plugin as the previous article, but with the option -slim: true. This will help you to optimally compress your packages by removing unnecessary files from them, which allows you to decrease your package size before AND after unzipping.
I have a number of python packages in GitHub repositories and it would be really great to have these available in PyPi. I know I could do these releases manually (update the version number, perhaps update a changelog, tag the release in GitHub, get the download url from GitHub, update PyPi with the release etc.) but I keep thinking that there must be a script/utility somewhere to make this a single-command process.
I am not massively familiar with the python packaging process, so perhaps I am coming at this from the wrong angle. I just don't think I can be the first one to have the idea of making this whole process a lot easier.
Edit: As there seems to be some confusion about what I am asking for: Are there any tools that make releasing Python packages to PyPi a faster and more streamlined process?
I have tried searching around but have yet to find anything.
Ok, I really don't know if anyone else has had this problem/concern, but I had an itch I needed to scratch so I have made this:
http://seed.readthedocs.org
I wouldn't be surprised if there is something out there already that does this better, but for now this is what I'll use :)
There is changes, software that makes the pypi publish just a single step. Looks like that is quite similar to seed.
Anyway, it would be nice if pypi could just check if on github there is a new tagged release and release it on pypi.