For efficiency's sake I am trying to figure out how python works with its heap of objects (and system of namespaces, but it is more or less clear). So, basically, I am trying to understand when objects are loaded into the heap, how many of them are there, how long they live etc.
And my question is when I work with a package and import something from it:
from pypackage import pymodule
what objects get loaded into the memory (into the object heap of the python interpreter)? And more generally: what happens? :)
I guess the above example does something like:
some object of the package pypackage was created in the memory (which contains some information about the package but not too much), the module pymodule was loaded into the memory and its reference was created in the local name space. The important thing here is: no other modules of the pypackage (or other objects) were created in the memory, unless it is stated explicitly (in the module itself, or somewhere in the package initialization tricks and hooks, which I am not familiar with). At the end the only one big thing in the memory is pymodule (i.e. all the objects that were created when the module was imported). Is it so? I would appreciate if someone clarified this matter a little bit. Maybe you could advice some useful article about it? (documentation covers more particular things)
I have found the following to the same question about the modules import:
When Python imports a module, it first checks the module registry (sys.modules) to see if the module is already imported. If that’s the case, Python uses the existing module object as is.
Otherwise, Python does something like this:
Create a new, empty module object (this is essentially a dictionary)
Insert that module object in the sys.modules dictionary
Load the module code object (if necessary, compile the module first)
Execute the module code object in the new module’s namespace. All variables assigned by the code will be available via the module object.
And would be grateful for the same kind of explanation about packages.
By the way, with packages a module name is added into the sys.modules oddly:
>>> import sys
>>> from pypacket import pymodule
>>> "pymodule" in sys.modules.keys()
False
>>> "pypacket" in sys.modules.keys()
True
And also there is a practical question concerning the same matter.
When I build a set of tools, which might be used in different processes and programs. And I put them in modules. I have no choice but to load a full module even when all I want is to use only one function declared there. As I see one can make this problem less painful by making small modules and putting them into a package (if a package doesn't load all of its modules when you import only one of them).
Is there a better way to make such libraries in Python? (With the mere functions, which don't have any dependencies within their module.) Is it possible with C-extensions?
PS sorry for such a long question.
You have a few different questions here. . .
About importing packages
When you import a package, the sequence of steps is the same as when you import a module. The only difference is that the packages's code (i.e., the code that creates the "module code object") is the code of the package's __init__.py.
So yes, the sub-modules of the package are not loaded unless the __init__.py does so explicitly. If you do from package import module, only module is loaded, unless of course it imports other modules from the package.
sys.modules names of modules loaded from packages
When you import a module from a package, the name is that is added to sys.modules is the "qualified name" that specifies the module name together with the dot-separated names of any packages you imported it from. So if you do from package.subpackage import mod, what is added to sys.modules is "package.subpackage.mod".
Importing only part of a module
It is usually not a big concern to have to import the whole module instead of just one function. You say it is "painful" but in practice it almost never is.
If, as you say, the functions have no external dependencies, then they are just pure Python and loading them will not take much time. Usually, if importing a module takes a long time, it's because it loads other modules, which means it does have external dependencies and you have to load the whole thing.
If your module has expensive operations that happen on module import (i.e., they are global module-level code and not inside a function), but aren't essential for use of all functions in the module, then you could, if you like, redesign your module to defer that loading until later. That is, if your module does something like:
def simpleFunction():
pass
# open files, read huge amounts of data, do slow stuff here
you can change it to
def simpleFunction():
pass
def loadData():
# open files, read huge amounts of data, do slow stuff here
and then tell people "call someModule.loadData() when you want to load the data". Or, as you suggested, you could put the expensive parts of the module into their own separate module within a package.
I've never found it to be the case that importing a module caused a meaningful performance impact unless the module was already large enough that it could reasonably be broken down into smaller modules. Making tons of tiny modules that each contain one function is unlikely to gain you anything except maintenance headaches from having to keep track of all those files. Do you actually have a specific situation where this makes a difference for you?
Also, regarding your last point, as far as I'm aware, the same all-or-nothing load strategy applies to C extension modules as for pure Python modules. Obviously, just like with Python modules, you could split things up into smaller extension modules, but you can't do from someExtensionModule import someFunction without also running the rest of the code that was packaged as part of that extension module.
The approximate sequence of steps that occurs when a module is imported is as follows:
Python tries to locate the module in sys.modules and does nothing else if it is found. Packages are keyed by their full name, so while pymodule is missing from sys.modules, pypacket.pymodule will be there (and can be obtained as sys.modules["pypacket.pymodule"].
Python locates the file that implements the module. If the module is part of the package, as determined by the x.y syntax, it will look for directories named x that contain both an __init__.py and y.py (or further subpackages). The bottom-most file located will be either a .py file, a .pyc file, or a .so/.pyd file. If no file that fits the module is found, an ImportError will be raised.
An empty module object is created, and the code in the module is executed with the module's __dict__ as the execution namespace.1
The module object is placed in sys.modules, and injected into the importer's namespace.
Step 3 is the point at which "objects get loaded into memory": the objects in question are the module object, and the contents of the namespace contained in its __dict__. This dict typically contains top-level functions and classes created as a side effect of executing all the def, class, and other top-level statements normally contained in each module.
Note that the above only desribes the default implementation of import. There is a number of ways one can customize import behavior, for example by overriding the __import__ built-in or by implementing import hooks.
1 If the module file is a .py source file, it will be compiled into memory first, and the code objects resulting from the compilation will be executed. If it is a .pyc, the code objects will be obtained by deserializing the file contents. If the module is a .so or a .pyd shared library, it will be loaded using the operating system's shared-library loading facility, and the init<module> C function will be invoked to initialize the module.
Related
For example:
if I write
import math
I am not good in programming. For a program to run it has to load the code to its memory and convert to 0s and 1s which a computer can understand?
So, will it load the entire math module when a reference to any function in that module is made in my program, or will it expand the module only once in the program? If it expands only once, I assume, the computer will load entire python file and all the modules it imports completely in memory? won't that cause a memory running out of space issue if I import too many native python code from the library?
Is that the reason some people say it is always good to import exact function in your program instead of wild cards?
Will it load the entire math module when a reference to any function
in that module is made in my program, or will it expand the module
only once in the program?
The math module will be loaded into memory once per Python program (or interpreter).
If it expands only once, I assume, the computer will load entire
python file and all the modules it imports completely in memory
Yes, in normal circumstances.
Won't that cause a memory running out of space issue if I import too
many native python code from the library?
No, not typically. Python modules would not put a dent in the memory of modern computers.
Is that the reason some people say it is always good to import exact
function in your program instead of wild cards?
No, the entire module will be loaded regardless if you use just one function in it. This is because that one function can rely on any other code in the module.
The reason it is advised to import specific functions is more of a best practice or recommendation to be explicit about what you are using from the module.
Also, the module may contain function names in it that are the same as ones you define yourself or are even in another imported module so you want to be careful not to import a bunch of names, especially if you are not going to use them.
Importing a python module means to load the namespace of what is available in that python module, into memory. Specifically, writing "import " tells python to look for a module with that name on your python path (typically a folder), and if it finds such an object, to then run that objects __init__.py file. This file is typically blank, meaning that python should simply load what is available in the module by reading through the files, but this file can also be customized.
Python automatically tracks what is loaded and doesn't re-load already loaded items. Hence, writing
import time
import itertools
import itertools as it
doesn't reload time if itertools also uses the time module and doesn't reload itertools if you rename it to it. In general, this is done very quickly; in fact, if the code is compiled (some modules are installed already compiled) it can be as fast as literally copying the bytes into memory (down to ns of time regardless of module size). Importing is one of the fastest commands you can do in python and is rarely the source of any speed issues. Copying and loading bytes into ram can be done millions of times a second and costs the computer nothing. It is when the computer has to perform calculations and compute that there is a concern.
Something like:
from itertools import combinations
does not load itertools any faster or slower than if you simply loaded the whole module (there are exceptions to this where some modules are so big that they are broken into sub-packages and so loading at the highest level doesn't load anything at all, for example, scipy, you have to specify the sub-package you want to load).
The reason it is recommended to not run from itertools import * is because this loads your namespace with dozens of functions which no one can track. Suddenly, you use the function combinations and no one that reads your code has any idea which module it came from. Hence, it is considered bad practice to * import. Ironically, some programming languages like C do nothing by * importing and it really is impossible to track where variables have come from.
Given a package:
package/
├── __init__.py
└── module.py
__init__.py:
from .module import function
module.py:
def function():
pass
One can import the package and print its namespace.
python -c 'import package; print(dir(package))'
['__builtins__', ..., 'function', 'module']
Question:
Why does the namespace of package contain module when only function was imported in the __init__.py?
I would have expected that the package's namespace would only contain function and not module. This mechanism is also mentioned in the Documentation,
"When a submodule is loaded using any mechanism (e.g. importlib APIs,
the import or import-from statements, or built-in __import__()) a
binding is placed in the parent module’s namespace to the submodule
object."
but is not really motivated. For me this choice seems odd, as I think of sub-modules as implementation detail to structure packages and do not expect them to be part of the API as the structure can change.
Also I know "Python is for consenting adults" and one cannot truly hide anything from a user. But I would argue, that binding the sub-modules names to the package's scopes makes it less obvious to a user what is actually part of the API and what can change.
Why no use a __sub_modules__ attribute or so to make sub-modules accessible to a user? What is the reason for this design decision?
You say you think of submodules as implementation details. This is not the design intent behind submodules; they can be, and extremely commonly are, part of the public interface of a package. The import system was designed to facilitate access to submodules, not to prevent access.
Loading a submodule places a binding into the parent's namespace because this is necessary for access to the module. For example, after the following code:
import package.submodule
the expression package.submodule must evaluate to the module object for the submodule. package evaluates to the module object for the package, so this module object must have a submodule attribute referring to the module object for the submodule.
At this point, you are almost certainly thinking, "hey, there's no reason from .submodule import function has to do the same thing!" It does the same thing because this attribute binding is part of submodule initialization, which only happens on the first import, and which needs to do the same setup regardless of what kind of import triggered it.
This is not an extremely strong reason. With enough changes and rejiggering, the import system definitely could have been designed the way you expect. It was not designed that way because the designers had different priorities than you. Python's design cares very little about hiding things or supporting any notion of privacy.
you have to understand that Python is a runtime language. def, class and import are all executable statements, that will, when executed, create (respectively) a function, class or module object and bind them in the current namespace.
wrt/ modules (packages being modules too - at least at runtime), the very first time a module is imported (directly or indirectly) for a given process, the matching .py (well, usually it's compiled .pyc version) is executed (all statements at the top level are executed in order), and the resulting namespace will be used to populate the module instance. Only once this has been done can any name defined in the module be accessed (you cannot access something that doesn't exist yet, can you ?). Then the module object is cached in sys.modules for subsequent imports. In this process, a when a sub-module is loaded, it's considered as an attribute of it's parent module.
For me this choice seems odd, as I think of sub-modules as implementation detail to structure packages and do not expect them to be part of the API as the structure can change
Actually, Python's designers considered things the other way round: a "package" (note that there's no 'package' type at runtime) is mostly a convenience to organize a collection of related modules - IOW, the ̀moduleis the real building block - and as a matter of fact, at runtime, when what you import is technically a "package", it still materializes as amodule` object.
Now wrt/ the "do not expect them to be part of the API as the structure can change", this has of course been taken into account. It's actually a quite common pattern to start out with a single module, and then turn it into a package as the code base grows - without impacting client code, of course. The key here is to make proper use of your package's initializer - the __init__.py file - which is actually what your package's module instance is built from. This lets the package act as a "facade", masking the "implementation details" of which submodule effectively defines which function, class or whatever.
So the solution here is plain simply to, in your package's __init__.py, 1/ import the names you want to make public (so the client code can import directly from your package instead of having to go thru the submodule) and 2/ define the __all__ attributes with the names that should be considered public so the interface is clearly documented.
FWIW, this last operation should be done for all your submodules too, and you can also use the _single_leading_underscore naming convention for things that are really really "implementation details".
None of this will of course prevent anyone to import even "private" names directly from your submodules, but then they are on their own when something breaks ("we are all consenting adults" etc).
Sorry for confusing title, let me explain what I mean. I came across a piece of code similar to the following using Google's PrettyTensor API, where it allows for custom functions to be added to the PrettyTensor class through its #prettytensor.Register() decorator.
(located in custom_ops.py)
import prettytensor as pt
#pt.Register(...)
def custom_foo(bar):
...
(located in main.py)
import prettytensor as pt
import custom_ops
x = pt.custom_foo(bar)
This code accesses prettytensor through 2 separate files, and I don't understand why the changes made in one file carry over to the other. What's also interesting is that the order of the imports doesn't matter.
import custom_ops
import prettytensor as pt
x = pt.custom_foo(bar)
The code above still works fine. I would like help finding an explanation for this phenomenon, as I could not find documentation for it anywhere. It seems to me like the python interpreter is caching the module in memory, and when it is altered by the custom_ops file it persists in the interpreter when it is imported again. If anyone knows why this happens, how would you stop it from occurring?
The reason both your modules see the same version of the prettytensor module is that Python caches the module objects it creates when it loads a module for the first time. The same module module object can then be imported any number of times in different places (or even several times within the same module, if you had a reason to do that), without being reloaded from its file.
You can see all the modules that have been loaded in the dictionary sys.modules. Whenever you do an import of a module that's already been loaded, Python will see it in sys.modules and you'll get a reference to the module object that already exists instead of a new module loaded from the .py file.
In general, this is what you want. It's usually a very bad thing if two different parts of the code can get a reference to a module loaded from the same file via two different module names. For instance, you can have two objects that both claim to be instances of class foo.Foo, but they could be instances of two different foo.Foo classes if foo can be accessed two different ways. This can make debugging a real nightmare.
Duplicated modules can happen if your Python module search path is messed up (so that the modules inside a package are also exposed at the top level). It can also happen with the __main__ module (created from the file you're running as a script), which can also be imported using its normal name (e.g. main in your example with main.py).
You can also manually reload a module using the reload function. In Python 2 this was a builtin, but it's stashed away in importlib now in Python 3.
I'm maintaining a dictionary and that is loaded inside the config file. The dictionary is loaded from a JSON file.
In config.py
name_dict = json.load(open(dict_file))
I'm importing this config file in several other scripts(file1.py, file2.py,...,filen.py) using
import config
statement. My question is when will the config.py script be executed ? I'm sure it wont be executed for every import call that is made inside my multiple scripts. But, what exactly happens when an import statement is called.
The top-level code in a module is executed once, the first time you import it. After that, the module object will be found in sys.modules, and the code will not be re-executed to re-generate it.
There are a few exceptions to this:
reload, obviously.
Accidentally importing the same module under two different names (e.g., if the module is in a package, and you've got some directory in the middle of the package in sys.path, you could end up with mypackage.mymodule and mymodule being two copies of the same thing, in which case the code gets run twice).
Installing import hooks/custom imported that replace the standard behavior.
Explicitly monkeying with sys.modules.
Directly calling functions out of imp/importlib or the like.
Certain cases with multiprocessing (and modules that use it indirectly, like concurrent.futures).
For Python 3.1 and later, this is all described in detail under The import system. In particular, look at the Searching section. (The multiprocessing-specific cases are described for that module.)
For earlier versions of Python, you pretty much have to infer the behavior from a variety of different sources and either reading the code or experimenting. However, the well-documented new behavior is intended to work like the old behavior except in specifically described ways, so you can usually get away with reading the 3.x docs even for 2.x.
Note that in general, you don't want to rely on whether top-level code in the module is run once or multiple times. For example, given a top-level function definition, as long as you never compare function objects, or rebind any globals that it (meaning the definition itself, not just the body) depends on, it doesn't make any difference. However, there are some exceptions to that, and loading start-time config files is a perfect example of an exception.
I've noticed sometimes if you call dir() on a package/module, you'll see other modules in the namespace that were imported as part of the implementation and aren't meant for you to use. For instance, if I install the fish package from PyPI and import it, I see fish.sys, which just refers to the built-in sys module.
My question is whether that's sane and what to do about it if it's not.
I don't think the __all__ variable is too relevant, since that only affects the behavior of from X import *. The options I see are:
structure your packages better, and at least push the namespace clutter down into submodules
use import X as _X in your package to distinguish implementation details from your package API
import things from inside your functions (blegh)
My question is whether that's sane
It's sane. Doing import fish adds just one name to your namespace, that is not "namespace clutter". It's pretty much the big idea behind modules, grouping many things under one name!
When you want to know what a module does, look at the documentation or call help, don't do dir.
All names in Python are stored in dictonaries. This means that no matter how many names you see, looking up one of them takes constant time. So there is no speed drawback of any kind either.