Why do packages have imports under __init__.py file? - python

For example in multiprocessing package we can import class Process using from multiprocessing import Process. Why not from multiprocessing.context import Process where it really belongs?
In fact, I found that they are the same. Why?

Adding imports to __init__ is usually done to shorten the import paths and define a public interface. multiprocessing.context import Process is an internal interface and can change in future without maintaining any backwards compatibility.
On the other hand multiprocessing import Process is the documented public interface and won't change to break backwards compatibility.
You could see that __all__ under context.py is empty meaning it has no public interface and you should not import from it in your application as it can change in future without any warnings.
__all__ = [] # things are copied from here to __init__.py
Related section on this from PEP-008:
Public and internal interfaces
Any backwards compatibility guarantees apply only to public interfaces. Accordingly, it is important that users be able to clearly
distinguish between public and internal interfaces.
Documented interfaces are considered public, unless the documentation explicitly declares them to be provisional or internal
interfaces exempt from the usual backwards compatibility guarantees.
All undocumented interfaces should be assumed to be internal.
To better support introspection, modules should explicitly declare the names in their public API using the __all__ attribute. Setting
__all__ to an empty list indicates that the module has no public API.
Even with __all__ set appropriately, internal interfaces (packages, modules, classes, functions, attributes or other names)
should still be prefixed with a single leading underscore.
An interface is also considered internal if any containing namespace (package, module or class) is considered internal.
Imported names should always be considered an implementation detail. Other modules must not rely on indirect access to such
imported names unless they are an explicitly documented part of the
containing module's API, such as os.path or a package's __init__
module that exposes functionality from submodules.
The famous requests library has a really nice public interface in my opinion and you could see that it is being done by importing most of things in the __init__.py file. And you're going to find that it is also documented based on the imports under __init__.py file.

from multiprocessing import Process works because Process is imported into the __init__.py of the multiprocessing package. For example, in your shell, type the following code:
import multiprocessing
with open(multiprocessing.__file__, 'r') as f:
print(f.readlines())
You'll see the lines:
from . import context
#
# Copy stuff from default context
#
globals().update((name, getattr(context._default_context, name))
for name in context._default_context.__all__)
__all__ = context._default_context.__all__
So yes, it's the same thing.

Related

Why does the from ... import ... statement contain an implicit import?

Given a package:
package/
├── __init__.py
└── module.py
__init__.py:
from .module import function
module.py:
def function():
pass
One can import the package and print its namespace.
python -c 'import package; print(dir(package))'
['__builtins__', ..., 'function', 'module']
Question:
Why does the namespace of package contain module when only function was imported in the __init__.py?
I would have expected that the package's namespace would only contain function and not module. This mechanism is also mentioned in the Documentation,
"When a submodule is loaded using any mechanism (e.g. importlib APIs,
the import or import-from statements, or built-in __import__()) a
binding is placed in the parent module’s namespace to the submodule
object."
but is not really motivated. For me this choice seems odd, as I think of sub-modules as implementation detail to structure packages and do not expect them to be part of the API as the structure can change.
Also I know "Python is for consenting adults" and one cannot truly hide anything from a user. But I would argue, that binding the sub-modules names to the package's scopes makes it less obvious to a user what is actually part of the API and what can change.
Why no use a __sub_modules__ attribute or so to make sub-modules accessible to a user? What is the reason for this design decision?
You say you think of submodules as implementation details. This is not the design intent behind submodules; they can be, and extremely commonly are, part of the public interface of a package. The import system was designed to facilitate access to submodules, not to prevent access.
Loading a submodule places a binding into the parent's namespace because this is necessary for access to the module. For example, after the following code:
import package.submodule
the expression package.submodule must evaluate to the module object for the submodule. package evaluates to the module object for the package, so this module object must have a submodule attribute referring to the module object for the submodule.
At this point, you are almost certainly thinking, "hey, there's no reason from .submodule import function has to do the same thing!" It does the same thing because this attribute binding is part of submodule initialization, which only happens on the first import, and which needs to do the same setup regardless of what kind of import triggered it.
This is not an extremely strong reason. With enough changes and rejiggering, the import system definitely could have been designed the way you expect. It was not designed that way because the designers had different priorities than you. Python's design cares very little about hiding things or supporting any notion of privacy.
you have to understand that Python is a runtime language. def, class and import are all executable statements, that will, when executed, create (respectively) a function, class or module object and bind them in the current namespace.
wrt/ modules (packages being modules too - at least at runtime), the very first time a module is imported (directly or indirectly) for a given process, the matching .py (well, usually it's compiled .pyc version) is executed (all statements at the top level are executed in order), and the resulting namespace will be used to populate the module instance. Only once this has been done can any name defined in the module be accessed (you cannot access something that doesn't exist yet, can you ?). Then the module object is cached in sys.modules for subsequent imports. In this process, a when a sub-module is loaded, it's considered as an attribute of it's parent module.
For me this choice seems odd, as I think of sub-modules as implementation detail to structure packages and do not expect them to be part of the API as the structure can change
Actually, Python's designers considered things the other way round: a "package" (note that there's no 'package' type at runtime) is mostly a convenience to organize a collection of related modules - IOW, the ̀moduleis the real building block - and as a matter of fact, at runtime, when what you import is technically a "package", it still materializes as amodule` object.
Now wrt/ the "do not expect them to be part of the API as the structure can change", this has of course been taken into account. It's actually a quite common pattern to start out with a single module, and then turn it into a package as the code base grows - without impacting client code, of course. The key here is to make proper use of your package's initializer - the __init__.py file - which is actually what your package's module instance is built from. This lets the package act as a "facade", masking the "implementation details" of which submodule effectively defines which function, class or whatever.
So the solution here is plain simply to, in your package's __init__.py, 1/ import the names you want to make public (so the client code can import directly from your package instead of having to go thru the submodule) and 2/ define the __all__ attributes with the names that should be considered public so the interface is clearly documented.
FWIW, this last operation should be done for all your submodules too, and you can also use the _single_leading_underscore naming convention for things that are really really "implementation details".
None of this will of course prevent anyone to import even "private" names directly from your submodules, but then they are on their own when something breaks ("we are all consenting adults" etc).

Google App Engine: importing modules within the handler's scope vs global scope

My project consists of a lot of imports which are placed at 'global' scope.
from google.appengine.ext import ndb
from handlers import SomeHandler
import logging
import datetime # will only use this module ONCE
I want to use the datetime module just once inside a specific handler. Should I import datetime within the handler or should I leave it in global scope?
import datetime # Should we import here?
class BookHandler(webapp2.RequestHandler):
import datetime # Or, should we import here?
def get(self):
today = datetime.datetime.now()
It seems like importing locally shows clearer dependency relationships. Is there any performance issue or other drawbacks to consider?
you should defiantly import modules at the beginning of your file which will bound it to the scope of that file. I think what you are doing is called 'lazy loading' the modules and it can causes bugs at runtime, if the module is not installed or imported correctly.
By the way, the way python works is that every time you import a module the interpreter looks if the module has already been imported. If it has been imported then it will set a reference to it. In another word, it doesn't create a new instance of it.
What i recommend is to create a file for your handler class and import datetime and anything else you want at the beginning of that file.
There is no problem importing inside the handler (or even inside the get() function if you prefer) - I'm using it massively.
Advantages of lazy-loading:
reduced app loading time
the average memory footprint of your application can be much lower than the total memory footprint required should you load all modules at init time, even those rarely used - lower cost
Disadvantages of lazy-loading:
non-deterministic app loading time
potential hard-to-reproduce bugs hit at delayed module loading since the exact conditions are unknown
Related (in the lazy-loading sense only): App Engine: Few big scripts or many small ones?
Hiding imports like this is an optimisation; whenever considering whether to optimise, verify that the proposed optimisation is really going to be effective.
Let's consider the specific example of datetime first. Here's a simple app:
import sys
import webapp2
class Handler(webapp2.RequestHandler):
def get(self):
if 'datetime' in sys.modules:
self.response.write('The datetime module is already loaded.\n')
else:
self.response.write('The datetime module is not already loaded\n.')
self.response.write('%d modules have been loaded\n' % len(sys.modules))
count = sum(1 for x in sys.modules.values() if '/lib64' in repr(x))
self.response.write('%d standard library modules have been loaded\n' % count)
gcount = sum(1 for x in sys.modules.values() if 'appengine' in repr(x))
self.response.write('%d appengine modules have been loaded\n' % gcount)
application = webapp2.WSGIApplication([('/', Handler)])
If we visit the '/' url we see this output:
The datetime module is already loaded.
706 modules have been loaded
95 standard library modules have been loaded
207 appengine modules have been loaded
Even in this minimal app, datetime has already been imported by the SDK*. Once Python has imported a module, futher imports only cost a single dictionary lookup, so there is no benefit in hiding the import. Given that the SDK has already imported 95 standard library modules and 207 SDK modules it follows that there is unlikely to be much benefit in hiding imports of commonly used standard library or SDK modules.
This leaves the question of imports of application code. Handlers can be lazy-loaded by declaring them as strings in routes, so that they are not imported until the route is visited:
app = webapp2.Application([('/foo', 'handlers.FooHandler')])
This technique permits optimising startup time without hiding imports in classes or methods, should you find that it is necessary.
The cost of lazy-loading, as the other answers point out, can be unexpected runtime errors. Moreover, if you choose of hide imports it can also decrease code readability, potentially cause structural problems, for example masking circular dependencies, and sets a poor example for less experienced developers, who may assume that hiding imports is idiomatic rather than an optimisation.
So, when considering optimising in this way:
verify that optimisation is necessary: not all applications require absolute maximum performance
verify that the right problem is being addressed; RPC calls on App Engine tend to dominate response times
profile to verify that the optimisation is effective
consider the costs in code maintainability
*Relying on the SDK's sys.modules being similar to that of the cloud runtime is, I hope, a reasonable assumption.

Flavoring Packages

I use a package which has a structure somewhat similar to the following when communicating with a piece of hardware:
channel
__init__.py
transport
__init__.py
flow.py
multiplex.py
network
__init__.py
header.py
addressing.py
I now wish to be able to configure my package so that I can use it to communicate with two very similar hardwares. For example, when communicating with hw1 I want the equivalent of the following in adressing.py:
from collections import namedtuple
PacketSize = namedtuple('PacketSize', ('header', 'body'))
packet_size = PacketSize(16,256)
while when testing hw2, I want the equivalent of:
from collections import namedtuple
PacketSize = namedtuple('PacketSize', ('header', 'body'))
packet_size = PacketSize(8,256)
Almost all of the modules in the packages are the same for both hw1 and hw2. I might however even have slightly different flavours for certain functions and classes within the package.
I was thinking I could solve this by having this structure:
channel
__init__.py
transport
__init__.py
flow.py
multiplex.py
network
__init__.py
header.py
addressing.py
hw1
__init__.py
addressing.py
hw2
__init__.py
addressing.py
So each subpackage will contain a hw1 and hw2 subpackage where hardware specific code is placed. I have programmed channel/network/addressing.py as follows:
from collections import namedtuple
PacketSize = namedtuple('PacketSize', ('header', 'body'))
if hardware == "hw1":
from hw1.targetprops import *
elif hardware == "hw2":
from hw2.targetprops import *
And channel/network/hw1/addressing.py like this:
from ..addressing import PacketSize
packet_size = PacketSize(16,256)
Does this make sense? I think channel/network/addressing.py is ugly to be honest since I'm doing an import, then I define a namedtuple, then I continue with the conditional imports. Could I do this better?
Is the general approach above the best way to flavor a package?
Is there a standard way to configure the package so that it knows if it is concerned with hw1 or hw2? At the moment I just have a global called hardware as seen above when I do if harware == "hw1".
You should try to abstract away the hardware/flavor dependent features behind some kind of common interface. There are many different ways to do this, such as class inheritance, composition, passing around an object, or --- as you may be looking to do --- even as a global python module or object.
I personally tend to often favor composition, because class inheritance often isn't a natural fit, and may blow up into multiple inheritance or MixInMadness.
A global Python module (or a sigleton object) is attractive, but I would steer away unless there really, really has to be only one of them in a single process. Good examples where this is a good design is when it is tied to the underlying platform, for instance the Python os module that has much the same interface on Windows and Linux but works very differently underneath. Compare this to your hw1 and hw2. Another good example is the Twisted reactor, of which there really can only be one running at a time. Even then, a large part of the Twisted code passes around a reactor object for e.g., compositing. This is partly to make unit testing possible.
For your example, if hw1 or hw2 refers to the hardware your program is running on, then a global python module does make sense. If it instead refers to hardware your program is communicating with, e.g., over a serial port or the network, then a global module is the wrong approach. You might have have two serial ports and want to speak to hw1 and hw2 in the same process.
For an example on how to use a global module, or actually a global object, I recommend looking at how Twisted does it. Then your modules would do something like
from mypackage import hardware # hw1/hw2 automatically detected
print hardware.packet_size # different for different hardware
or
# main.py
from mypackage import hw1
hw1.init()
# other.py
from mypackage import hardware # initialized to hw1 in main.py
Compositing or passing around an object on the other hand would look something like:
hw = mypackage.hw1.hw1factory()
send_data(hw, 'foo') # uses hw.packet_size
hw = mypackage.hw2.hw2factory()
frob = Frobnicator(hw)
frob.frobnicate('foo') # uses hw.packet_size internally
Your approach is wrong. You should not have multiple modules to distinguish cases. In the case you describe, the module color.py might contain a function to which you pass a list of items to be tested and colors to test the items with. How you organize this depends on the data source and destination and the nature of the items you are testing.
You should consider using py.test and fixtures (which are NOT like Django fixtures). These do exactly what you want and py.test can handle UnitTest and Nose style tests as well.

What happens when you import a package?

For efficiency's sake I am trying to figure out how python works with its heap of objects (and system of namespaces, but it is more or less clear). So, basically, I am trying to understand when objects are loaded into the heap, how many of them are there, how long they live etc.
And my question is when I work with a package and import something from it:
from pypackage import pymodule
what objects get loaded into the memory (into the object heap of the python interpreter)? And more generally: what happens? :)
I guess the above example does something like:
some object of the package pypackage was created in the memory (which contains some information about the package but not too much), the module pymodule was loaded into the memory and its reference was created in the local name space. The important thing here is: no other modules of the pypackage (or other objects) were created in the memory, unless it is stated explicitly (in the module itself, or somewhere in the package initialization tricks and hooks, which I am not familiar with). At the end the only one big thing in the memory is pymodule (i.e. all the objects that were created when the module was imported). Is it so? I would appreciate if someone clarified this matter a little bit. Maybe you could advice some useful article about it? (documentation covers more particular things)
I have found the following to the same question about the modules import:
When Python imports a module, it first checks the module registry (sys.modules) to see if the module is already imported. If that’s the case, Python uses the existing module object as is.
Otherwise, Python does something like this:
Create a new, empty module object (this is essentially a dictionary)
Insert that module object in the sys.modules dictionary
Load the module code object (if necessary, compile the module first)
Execute the module code object in the new module’s namespace. All variables assigned by the code will be available via the module object.
And would be grateful for the same kind of explanation about packages.
By the way, with packages a module name is added into the sys.modules oddly:
>>> import sys
>>> from pypacket import pymodule
>>> "pymodule" in sys.modules.keys()
False
>>> "pypacket" in sys.modules.keys()
True
And also there is a practical question concerning the same matter.
When I build a set of tools, which might be used in different processes and programs. And I put them in modules. I have no choice but to load a full module even when all I want is to use only one function declared there. As I see one can make this problem less painful by making small modules and putting them into a package (if a package doesn't load all of its modules when you import only one of them).
Is there a better way to make such libraries in Python? (With the mere functions, which don't have any dependencies within their module.) Is it possible with C-extensions?
PS sorry for such a long question.
You have a few different questions here. . .
About importing packages
When you import a package, the sequence of steps is the same as when you import a module. The only difference is that the packages's code (i.e., the code that creates the "module code object") is the code of the package's __init__.py.
So yes, the sub-modules of the package are not loaded unless the __init__.py does so explicitly. If you do from package import module, only module is loaded, unless of course it imports other modules from the package.
sys.modules names of modules loaded from packages
When you import a module from a package, the name is that is added to sys.modules is the "qualified name" that specifies the module name together with the dot-separated names of any packages you imported it from. So if you do from package.subpackage import mod, what is added to sys.modules is "package.subpackage.mod".
Importing only part of a module
It is usually not a big concern to have to import the whole module instead of just one function. You say it is "painful" but in practice it almost never is.
If, as you say, the functions have no external dependencies, then they are just pure Python and loading them will not take much time. Usually, if importing a module takes a long time, it's because it loads other modules, which means it does have external dependencies and you have to load the whole thing.
If your module has expensive operations that happen on module import (i.e., they are global module-level code and not inside a function), but aren't essential for use of all functions in the module, then you could, if you like, redesign your module to defer that loading until later. That is, if your module does something like:
def simpleFunction():
pass
# open files, read huge amounts of data, do slow stuff here
you can change it to
def simpleFunction():
pass
def loadData():
# open files, read huge amounts of data, do slow stuff here
and then tell people "call someModule.loadData() when you want to load the data". Or, as you suggested, you could put the expensive parts of the module into their own separate module within a package.
I've never found it to be the case that importing a module caused a meaningful performance impact unless the module was already large enough that it could reasonably be broken down into smaller modules. Making tons of tiny modules that each contain one function is unlikely to gain you anything except maintenance headaches from having to keep track of all those files. Do you actually have a specific situation where this makes a difference for you?
Also, regarding your last point, as far as I'm aware, the same all-or-nothing load strategy applies to C extension modules as for pure Python modules. Obviously, just like with Python modules, you could split things up into smaller extension modules, but you can't do from someExtensionModule import someFunction without also running the rest of the code that was packaged as part of that extension module.
The approximate sequence of steps that occurs when a module is imported is as follows:
Python tries to locate the module in sys.modules and does nothing else if it is found. Packages are keyed by their full name, so while pymodule is missing from sys.modules, pypacket.pymodule will be there (and can be obtained as sys.modules["pypacket.pymodule"].
Python locates the file that implements the module. If the module is part of the package, as determined by the x.y syntax, it will look for directories named x that contain both an __init__.py and y.py (or further subpackages). The bottom-most file located will be either a .py file, a .pyc file, or a .so/.pyd file. If no file that fits the module is found, an ImportError will be raised.
An empty module object is created, and the code in the module is executed with the module's __dict__ as the execution namespace.1
The module object is placed in sys.modules, and injected into the importer's namespace.
Step 3 is the point at which "objects get loaded into memory": the objects in question are the module object, and the contents of the namespace contained in its __dict__. This dict typically contains top-level functions and classes created as a side effect of executing all the def, class, and other top-level statements normally contained in each module.
Note that the above only desribes the default implementation of import. There is a number of ways one can customize import behavior, for example by overriding the __import__ built-in or by implementing import hooks.
1 If the module file is a .py source file, it will be compiled into memory first, and the code objects resulting from the compilation will be executed. If it is a .pyc, the code objects will be obtained by deserializing the file contents. If the module is a .so or a .pyd shared library, it will be loaded using the operating system's shared-library loading facility, and the init<module> C function will be invoked to initialize the module.

What are the advantages and disadvantages of the require vs. import methods of loading code?

Ruby uses require, Python uses import. They're substantially different models, and while I'm more used to the require model, I can see a few places where I think I like import more. I'm curious what things people find particularly easy — or more interestingly, harder than they should be — with each of these models.
In particular, if you were writing a new programming language, how would you design a code-loading mechanism? Which "pros" and "cons" would weigh most heavily on your design choice?
The Python import has a major feature in that it ties two things together -- how to find the import and under what namespace to include it.
This creates very explicit code:
import xml.sax
This specifies where to find the code we want to use, by the rules of the Python search path.
At the same time, all objects that we want to access live under this exact namespace, for example xml.sax.ContentHandler.
I regard this as an advantage to Ruby's require. require 'xml' might in fact make objects inside the namespace XML or any other namespace available in the module, without this being directly evident from the require line.
If xml.sax.ContentHandler is too long, you may specify a different name when importing:
import xml.sax as X
And it is now avalable under X.ContentHandler.
This way Python requires you to explicitly build the namespace of each module. Python namespaces are thus very "physical", and I'll explain what I mean:
By default, only names directly defined in the module are available in its namespace: functions, classes and so.
To add to a module's namespace, you explicitly import the names you wish to add, placing them (by reference) "physically" in the current module.
For example, if we have the small Python package "process" with internal submodules machine and interface, and we wish to present this as one convenient namespace directly under the package name, this is and example of what we could write in the "package definition" file process/__init__.py:
from process.interface import *
from process.machine import Machine, HelperMachine
Thus we lift up what would normally be accessible as process.machine.Machine up to process.Machine. And we add all names from process.interface to process namespace, in a very explicit fashion.
The advantages of Python's import that I wrote about were simply two:
Clear what you include when using import
Explicit how you modify your own module's namespace (for the program or for others to import)
A nice property of require is that it is actually a method defined in Kernel. Thus you can override it and implement your own packaging system for Ruby, which is what e.g. Rubygems does!
PS: I am not selling monkey patching here, but the fact that Ruby's package system can be rewritten by the user (even to work like python's system). When you write a new programming language, you cannot get everything right. Thus if your import mechanism is fully extensible (into totally all directions) from within the language, you do your future users the best service. A language that is not fully extensible from within itself is an evolutionary dead-end. I'd say this is one of the things Matz got right with Ruby.
Python's import provides a very explicit kind of namespace: the namespace is the path, you don't have to look into files to know what namespace they do their definitions in, and your file is not cluttered with namespace definitions. This makes the namespace scheme of an application simple and fast to understand (just look at the source tree), and avoids simple mistakes like mistyping a namespace declaration.
A nice side effect is every file has its own private namespace, so you don't have to worry about conflicts when naming things.
Sometimes namespaces can get annoying too, having things like some.module.far.far.away.TheClass() everywhere can quickly make your code very long and boring to type. In these cases you can import ... from ... and inject bits of another namespace in the current one. If the injection causes a conflict with the module you are importing in, you can simply rename the thing you imported: from some.other.module import Bar as BarFromOtherModule.
Python is still vulnerable to problems like circular imports, but it's the application design more than the language that has to be blamed in these cases.
So python took C++ namespace and #include and largely extended on it. On the other hand I don't see in which way ruby's module and require add anything new to these, and you have the exact same horrible problems like global namespace cluttering.
Disclaimer, I am by no means a Python expert.
The biggest advantage I see to require over import is simply that you don't have to worry about understanding the mapping between namespaces and file paths. It's obvious: it's just a standard file path.
I really like the emphasis on namespacing that import has, but can't help but wonder if this particular approach isn't too inflexible. As far as I can tell, the only means of controlling a module's naming in Python is by altering the filename of the module being imported or using an as rename. Additionally, with explicit namespacing, you have a means by which you can refer to something by its fully-qualified identifier, but with implicit namespacing, you have no means to do this inside the module itself, and that can lead to potential ambiguities that are difficult to resolve without renaming.
i.e., in foo.py:
class Bar:
def myself(self):
return foo.Bar
This fails with:
Traceback (most recent call last):
File "", line 1, in ?
File "foo.py", line 3, in myself
return foo.Bar
NameError: global name 'foo' is not defined
Both implementations use a list of locations to search from, which strikes me as a critically important component, regardless of the model you choose.
What if a code-loading mechanism like require was used, but the language simply didn't have a global namespace? i.e., everything, everywhere must be namespaced, but the developer has full control over which namespace the class is defined in, and that namespace declaration occurs explicitly in the code rather than via the filename. Alternatively, defining something in the global namespace generates a warning. Is that a best-of-both-worlds approach, or is there an obvious downside to it that I'm missing?

Categories

Resources