I've seen several examples of .dockerignore files for Python projects where *.pyc files and/or __pycache__ folders are ignored:
**/__pycache__
*.pyc
Since these files/folders are going to be recreated in the container anyway, I wonder if it's a good practice to do so.
Yes, it's a recommended practice. There are several reasons:
Reduce the size of the resulting image
In .dockerignore you specify files that won't go to the resulting image, it may be crucial when you're building the smallest image. Roughly speaking the size of bytecode files is equal to the size of actual files. Bytecode files aren't intended for distribution, that's why we usually put them into .gitignore as well.
Cache related problems
In earlier versions of Python 3.x there were several cached related issues:
Python’s scheme for caching bytecode in .pyc files did not work well
in environments with multiple Python interpreters. If one interpreter
encountered a cached file created by another interpreter, it would
recompile the source and overwrite the cached file, thus losing the
benefits of caching.
Since Python 3.2 all the cached files prefixed with interpreter version as mymodule.cpython-32.pyc and presented under __pychache__ directory. By the way, starting from Python 3.8 you can even control a directory where the cache will be stored. It may be useful when you're restricting write access to the directory but still want to get benefits of cache usage.
Usually, the cache system works perfectly, but someday something may go wrong. It worth to note that the cached .pyc (lives in the same directory) file will be used instead of the .py file if the .py the file is missing. In practice, it's not a common occurrence, but if some stuff keeps up being "there", thinking about remove cache files is a good point. It may be important when you're experimenting with the cache system in Python or executing scripts in different environments.
Security reasons
Most likely that you don't even need to think about it, but cache files can contain some sort of sensitive information. Due to the current implementation, in .pyc files presented an absolute path to the actual files. There are situations when you don't want to share such information.
It seems that interacting with bytecode files is a quite frequent necessity, for example, django-extensions have appropriate options compile_pyc and clean_pyc.
Related
I am coming from Java background and completely new at Python.
Now I have got a Python project that consists of a few Python scripts and pickle files stored in Git. The pickle files are serialized sklearn models.
I wonder how to organize this project. I think we should not store the pickle files in Git. We should probably store them as binary dependencies somewhere.
Does it make sense ? What is a common way to store binary dependencies of Python projects
Git is just fine with binary data. For example, many projects store e.g. images in git repos.
I guess, the rule of thumb is to decide whenever your binary files are source material, an external dependency, or an intermediate build step. Of course, there are no strict rules, so just decide how you feel about them. Here are my suggestions:
If they're (reproducibly) generated from something, .gitignore the binaries and have scripts that build the necessary data. It could be in the same, or in a separate repo - depending on where it feels best.
Same logic applies if they're obtained from some external source, e.g. an external download. Usually, we don't store dependencies in the repository - we only keep references to them. E.g. we don't keep virtualenvs but only have requirements.txt file - the Java world analogy is (a rough approximation) like not having .jars but only pom.xml or a dependencies section in build.gradle.
If they can be considered to be a source material, e.g. if you manipulate them with Python as an editor - don't worry about the files' binary nature and just have them in your repository.
If they aren't really a source material, but their generation process is really complicated or takes very long, and the files aren't meant to be updated on a regular basis - I think it won't be terribly wrong to have them in the repo. Leaving a note (README.txt or something) about how the files were produced would be a good idea, of course.
Oh, and if the files are large (like, hundreds of megabytes or more), consider taking a look at git-lfs.
I am writing a memoizing decorator for Python 3. The plan is to pickle the cache in a temporary file to save time across multiple executions.
The cache will be deleted if the file containing the decorated function has been changed, just like the .pyc file in __pycache__. So then I started thinking about uncluttering the user's project directory by placing the pickled cache in __pycache__. Are there side effects of storing files in this directory? Is it misleading or confusing?
You should either...
Have all caching to happen at runtime: This beats your purpose. However, it is the only way not to touch the filesystem at all.
Dedicate a special folder like __memo__ or something: This will allow you to have caching across several executions. However, you'll "pollute" the application's file structure.
Reasons you should never mess up with __pycache__:
__pycache__ is CPython-specific. Although it's most likely the implementation you're using right now, it's not portable/safe/good practice to assume everyone does for no good reason at all.
__pycache__ is intended as a transparent optimization, that is, no one should care if it exists at all or not. This is a design decision from the CPython folks, and so you should not circumvent it.
Due to the above, the aforementioned directory may be disabled if the user wants to. For instance, in CPython, if you do python3 -B myapp.py, no __pycache__ will be created if it doesn't exist, and otherwise will be ignored.
Users often delete __pycache__ because of the above two points for several reasons. I do, at least.
Messing things inside __pycache__ because "it doesn't pollute the application's file structure" is an illusion. The directory can perfectly be considered as "pollution". If the Python interpreter already pollutes stuff, why can't you with __memo__ anyway?
There are python tools like check-manifest, to verify that all your files under your vcs are included also in your MANIFEST.in. And releasing helpers like zest.releaser recommend you to use them.
I think files in tests or docs directories are never used directly from the python package. Usually services like read the docs or travis-ci are going to access that files, and they get the files from the vcs, not from the package. I have seen also packages including .travis.yml files, what makes even less sense to me.
What is the advantage of including all the files in the python package?
I don't remember these files being there when I first started working with my Django application, and they are not mentioned in the Django tutorial's list of files
I looked into this more and discovered that these are what is known as bytecode – as discussed at this S.O. question
How did these files get created, and should I keep them in my folder or trash them?
Bytecode files are created by the interpreter the first time the code is run, so that (among other things) it will run faster with subsequent running. The only reason you should remove them is if you've changed the code in one of the corresponding .py files, and for some reason the interpreter is not picking up the changes (possibly due to an issue with the timestamp). Additionally, you can also get rid of them if you change versions of Python (for example, from 2 to 3, or from 2.6 to 2.7). Changing micro versions (such as 2.7.3 to 2.7.9) won't affect the bytecode structure, so a small upgrade such as that is harmless. And, as sapi points out, if you want to delete a .py file for some reason, you should also delete the corresponding .pyc as well.
From mgilson, see this blog post from Ned Batchelder on the structure and function of .pyc files.
As I understand there are two types of modules in python (CPython):
- the .so (C extension)
- the .py
The .so are only loaded once even when there are different processes/interpreters importing them.
The .py are loaded once for each process/interpreter (unless reloading explicitly).
Is there a way .py can be shared by multiple processes/interpreters?
One would still need some layer where one could store modifications done to the module.
I'm thinking one could embed the interpreter in a .so as the first step. Is there an already developed solution.
I acknowledge i may be very far off in terms of feasible ideas about this. Please excuse my ignorance.
The reason .so (or .pyd) files take up memory space only once (except for their variables segment) is that they are recognized by the OS kernel as object code. .py files are only recognized as text files/data; it's the Python interpreter that grants them "code" status. Embedding the Python interpreter in a shared library won't resolve this.
Loading .py files only once despite their use in multiple processes would require changes deep inside CPython.
Your best option, if you want to save memory space, is to compile Python modules to .so files using Cython. That may require some changes to the modules.
No, there is no way. Python is so highly dynamic that each process that I'm not sure it would make any sense anyway, as you could monkey-patch the modules, for example. Perhaps there would be a way to share the code anyway, but the benefit would be very small for something that is likely to be a lot of work.
The best answer I can give you is "not impossible, but I don't know if it happens".
You have to think about what is actually happening. When you encounter a .py file, Python has to read the file, compile it, and then execute byte code. Compilation takes place inside of the process, and so can't be shared.
When you encounter a .so file, the operating system links in memory that has been reserved for that library. All processes share the same memory region, and so you save memory.
Python already has a third way of loading modules. If it can, upon loading a .py file, it creates a pre-compiled .pyc file that is faster to load (you avoid compilation). The next time it loads the .pyc file. They conceivably could the .pyc file by just mmapping it into memory. (Using MAP_PRIVATE in case other things mess with that byte code later.) If they did that, then shared modules would by default wind up in shared memory.
I have no idea whether it has actually been implemented in this way.