How to organize a Python project with pickle files?

How to organize a Python project with pickle files? - python

I am coming from Java background and completely new at Python.
Now I have got a Python project that consists of a few Python scripts and pickle files stored in Git. The pickle files are serialized sklearn models.
I wonder how to organize this project. I think we should not store the pickle files in Git. We should probably store them as binary dependencies somewhere.
Does it make sense ? What is a common way to store binary dependencies of Python projects

Git is just fine with binary data. For example, many projects store e.g. images in git repos.
I guess, the rule of thumb is to decide whenever your binary files are source material, an external dependency, or an intermediate build step. Of course, there are no strict rules, so just decide how you feel about them. Here are my suggestions:
If they're (reproducibly) generated from something, .gitignore the binaries and have scripts that build the necessary data. It could be in the same, or in a separate repo - depending on where it feels best.
Same logic applies if they're obtained from some external source, e.g. an external download. Usually, we don't store dependencies in the repository - we only keep references to them. E.g. we don't keep virtualenvs but only have requirements.txt file - the Java world analogy is (a rough approximation) like not having .jars but only pom.xml or a dependencies section in build.gradle.
If they can be considered to be a source material, e.g. if you manipulate them with Python as an editor - don't worry about the files' binary nature and just have them in your repository.
If they aren't really a source material, but their generation process is really complicated or takes very long, and the files aren't meant to be updated on a regular basis - I think it won't be terribly wrong to have them in the repo. Leaving a note (README.txt or something) about how the files were produced would be a good idea, of course.
Oh, and if the files are large (like, hundreds of megabytes or more), consider taking a look at git-lfs.

Related

Where to store large resource files with python/pypi packages?

I have the following problem: I'm developing a package in python that will need a static data file for its operation that is somewhat large (currently around 70 MB, may get larger over time).
This isn't excessively large, but it's likely beyond what pypi will accept, so just having the file as a resource file as part of the package is not really an option. It also doesn't compress very well.
So I'm expecting to do something of the following: I'll store the file somewhere where it can be downloaded via https and add a command to the tool that will download that extra data needed. (I.e. expect something like a commandline tool with a --fetch-operational-data parameter that one might call once after installation and may call for updates every now and then, though updates of that file are expected to be rare.)
However this leads to the question where to store that data, and I feel there's no really good option.
"Usually" package resource files can be managed with importlib_resources which can access files that are stored within module directories. But the functions like open_binary are all read only and while one could probably get the path and write there, this probably goes against the intention of how it is supposed to be used (e.g. a major selling point for the importlib functionality is that it can be used in zip'ed packages, and that would obviously break).
Alternatively one could have a dot directory (~/.mytool/). However this means there's no good way to install this globally.
On the other hand there could be a system-wide directory (/var/lib/mytool ?), but then a user couldn't use the package. One could try to autodetect if the data is in /var/lib and fallback to ~/.mytool and write to whatever is writable on the update command.
Furthermore currently the tool is executable from its git repo, which adds another complexity (would it download the file into an extra dir in the gitrepo if it's executed from there? or also use /var/lib/mytool / ~/.mytool ?)
Whatever I would do, it feels like an ugly hack. Any good ideas?

Binary data: How should I make it available if I do not commit them to Git?

I am using Python and Git for scientific applications. I often develop software tools that consist of a relatively small codebase and some large binary datasets.
Ideally, I want the software to work "out of the box", without the user having to manually download additional binary files. If I commit the binary data to Git, users can install everything using a single pip command (pip install git+https://github.com/myuser/myrepo). However, committing binary files is considered bad practice for a number of good reasons, such as slow download times if users want to upgrade the codebase. Also, a small backwards-compatible change to the binary files may lead to a large increase in the repo size.
Alternatively, I can put the binary data on an external server and let the software download the data if not present on the system (or if a newer version is available). Is there a preferred way to implement this procedure? In particular, is there a standard "download location" where the datasets should be placed?

For big files, you can enable Github's LFS service.
Git Large File Storage (LFS) replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git, while storing the file contents on a remote server like GitHub.com or GitHub Enterprise.
I think, if your data will not change very often, you can just put your data in github release as attachment.

Should I add Python's pyc files to .dockerignore?

I've seen several examples of .dockerignore files for Python projects where *.pyc files and/or __pycache__ folders are ignored:
**/__pycache__
*.pyc
Since these files/folders are going to be recreated in the container anyway, I wonder if it's a good practice to do so.

Yes, it's a recommended practice. There are several reasons:
Reduce the size of the resulting image
In .dockerignore you specify files that won't go to the resulting image, it may be crucial when you're building the smallest image. Roughly speaking the size of bytecode files is equal to the size of actual files. Bytecode files aren't intended for distribution, that's why we usually put them into .gitignore as well.
Cache related problems
In earlier versions of Python 3.x there were several cached related issues:
Python’s scheme for caching bytecode in .pyc files did not work well
in environments with multiple Python interpreters. If one interpreter
encountered a cached file created by another interpreter, it would
recompile the source and overwrite the cached file, thus losing the
benefits of caching.
Since Python 3.2 all the cached files prefixed with interpreter version as mymodule.cpython-32.pyc and presented under __pychache__ directory. By the way, starting from Python 3.8 you can even control a directory where the cache will be stored. It may be useful when you're restricting write access to the directory but still want to get benefits of cache usage.
Usually, the cache system works perfectly, but someday something may go wrong. It worth to note that the cached .pyc (lives in the same directory) file will be used instead of the .py file if the .py the file is missing. In practice, it's not a common occurrence, but if some stuff keeps up being "there", thinking about remove cache files is a good point. It may be important when you're experimenting with the cache system in Python or executing scripts in different environments.
Security reasons
Most likely that you don't even need to think about it, but cache files can contain some sort of sensitive information. Due to the current implementation, in .pyc files presented an absolute path to the actual files. There are situations when you don't want to share such information.
It seems that interacting with bytecode files is a quite frequent necessity, for example, django-extensions have appropriate options compile_pyc and clean_pyc.

How to organize and categorize small projects in GitHub?

I'm new to development and trying to upload small projects that I've worked on to my GitHub profile. These projects are not dependent on each other.
My issue is that some of them are small single-file projects. Sort of like mini challenges that I've solved. So I'm thinking of grouping them together under one repo called "Python programming", for example.
Is this a good practice?
If yes, how should I go about it in Git, and
how can I still have a README file showing up for each mini project.
If no, what would you recommend doing?

GitHub will render a README file for every folder you visit, so when using just a single repository, one solution would be to still create one sub folder for each “subproject” that as such can have its own README file.
But before going that route, you should think about if those small projects actually belong together. That’s ultimately what should decide whether you want to put them all in the same repository or whether you want to split it up into several repositories.
Some things to consider for that decision:
If the projects do not depend on another, do they still relate to another? For example, are those projects part of a bigger programming challenge like Project Euler and you’re just collecting all your solutions? Then a single repository might make more sense.
What is the chance for individual projects to grow into bigger things? Many things start very small but can eventually grow into real things that justify their own repository. At that point, you might even get others to contribute.
Does it make sense for those individual files to share a history? Are the files even going to be edited once they are “done”? I.e. is this just a collection of finished things, or are they actually ongoing experiments?
Ultimately, it comes down to your personal choice. But GitHub, as the repository hoster, should not be driving your decision. You should create Git repositories locally as it makes sense to you. If that means you just have a single one, that’s fine. If that means you create lots of them, that’s also fine.
Unfortunately, the GitHub UI is not really made for small one-off projects. The repository list is just to unorganized for that. If you decide to use small projects, I advise you to add some prefix for categorization within your GitHub profile, so you know what this is about.
A good alternative for one-off projects, especially when it’s just a single (or a few) files are Gists. Gists are born as a way to share code snippets but under the hood, every Gist is actually a full Git repository. Of course, Gists do not offer the tools normal repositories on GitHub have (e.g. issues, pull requests, wikis). But for what you describe, you probably need neither of those. Then, Gists are a fine way to share simple things without adding full repositories to your profile. And you can still clone them (the remote URL is git#gist.github.com:/<gist-id>.git) and have a full history and support for multiple files if you need those.

Commonly, you'll see that the top level of the repo contains the README file, maybe a setup.py and some other extraneous information, and perhaps a tests folder. Then there will be a folder that shares a name with the repo. Inside of that folder is the code that's intended to be core content of the module/package/script.
It's also not unusual to see different organization, particularly with very small projects of single-file scripts.
For the specific case you mention, do whatever you like. What you propose sounds totally reasonable to me. I would not want to have a separate repo for all the challenges I solve!
I usually use a gist for trivial items I don't necessarily want to make a repo for, including coding challenges. So I would offer that as an alternative. Do whatever suits you best.

Organizing Python projects with shared packages

What is the best way to organize and develop a project composed of many small scripts sharing one (or more) larger Python libraries?
We have a bunch of programs in our repository that all use the same libraries stored in the same repository. So in other words, a layout like
trunk
libs
python
utilities
projects
projA
projB
When the official runs of our programs are done, we want to record what version of the code was used. For our C++ executables, things are simple because as long as the working copy is clean at compile time, everything is fine. (And since we get the version number programmatically, it must be a working copy, not an export.) For Python scripts, things are more complicated.
The problem is that, often one project (e.g. projA) will be running, and projB will need to be updated. This could cause the working copy revision to appear mixed to projA during runtime. (The code takes hours to run, and can be used as inputs for processes that take days to run, hence the strong traceability goal.)
My current workaround is, if necessary, check out another copy of the trunk to a different location, and run off there. But then I need to remember to change my PYTHONPATH to point to the second version of lib/python, not the one in the first tree.
There's not likely to be a perfect answer. But there must be a better way.
Should we be using subversion keywords to store the revision number, which would allow the data user to export files? Should we be using virtualenv? Should we be going more towards a packaging and installation mechanism? Setuptools is the standard, but I've read mixed things about it, and it seems designed for non-developer end users (of which we have none).

The much better solution involves not storing all your projects and their shared dependencies in the same repository.
Use one repository for each project, and externals for the shared libraries.
Make use of tags in the shared library repositories, so consumer projects may use exactly the version they need in their external.
Edit: (just copying this from my comment) use virtualenv if you need to provide isolated runtime environments for the different apps on the same server. Then each environment can contain a unique version of the library it needs.

If I'm understanding your question properly, then you definitely want virtualenv. Add in some virtualenvwrapper goodness to make it that much better.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.