Reading the builtin python modules [duplicate] - python

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How do I find the location of Python module sources?
I dont understand how to read the code in the builtin python modules. I know how to find out whats in a module for example,
import os;
dir(os)
But when I try to look for example for the function listdir I cannot find a def listdir to read what it actually does.

One word: inspect.
The inspect module provides several useful functions to help get information about live objects such as modules, classes, methods, functions, tracebacks, frame objects, and code objects. For example, it can help you examine the contents of a class, retrieve the source code of a method, extract and format the argument list for a function, or get all the information you need to display a detailed traceback.
It's in the standard library, and the docs have examples. So, you just print(inspect.getsource(os)), or do inspect.getsourcefile(os), etc.
Note that some of the standard-library modules are written in C (or are even fake modules built into the interpreter), in which case getsourcefile returns nothing, but getfile will at least tell you it's a .so/.pyd/whatever, which you can use to look up the original C source in, say, a copy of the Python source code.
You can also just type help(os), and the FILE right at the top gives you the path (generally the same as getsourcefile for Python modules, the same a getfile otherwise).
And you can always go to the online source for the Python modules and C extension modules. Just change the "2.7" to "3.3", etc., in the URL to get different versions. (I believe if you remove the version entirely, you get the trunk code, currently corresponding to 3.4 pre-alpha, but don't quote me on that.)
The os.listdir function isn't actually defined directly in os; it's effectively from <platform-specific-module> import * imported. You can trace it down through a few steps yourself, but it's usually going to be posix_listdir in posixmodule.c on most platforms. (Even Windows—recent versions use the same file to define the posix module on non-Windows, and the nt and posix modules on Windows, and there's a bunch of #if defined(…) stuff in the code.)

Related

Support for POSIX openat functions in python

There is a patch to add support for the POSIX openat functions (and other *at functions like fstatat) to the python standard library that is marked as closed with resolution fixed, but the os, posix and platform modules do not currently include any of these methods.
These methods are the standard way of solving problems like this in C and other languages efficiently and without race conditions.
Are these included in the standard library currently somewhere? And if not, are there plans to include this in the future.
Yes, this is supported by passing the dir_fd argument to various functions in the standard os module. See for example os.open():
Open the file path and set various flags [...]
This function can support paths relative to directory descriptors with the dir_fd parameter.
If you want to use high-level file objects such as those returned by the builtin open() function, that function's documentation provides example code showing how to do this using the opener parameter to that function. Note that open() and os.open() are entirely different functions and should not be confused. Alternatively, you could open the file with os.open() and then pass the file descriptor number to os.fdopen() or to open().
It should also be noted that this currently only works on Unix; the portable and future-proof way to check for dir_fd support is to write code such as the following:
if os.open in os.supports_dir_fd:
# Use dir_fd.
else:
# Don't.
On the other hand, I'm not entirely sure Windows even allows opening a directory in the first place. You certainly can't do it with _open()/_wopen(), which are documented to fail if "the given path is a directory." To be safe, I recommend only trying to open the directory after you check for dir_fd support.

Enforcing the order of extension loading

I have two python extensions (dynamic libraries), say a.so and b.so. Of the two, a.so depends on b.so, specifically it uses a type defined in b.so.
In python, I could safely do
import b
import a
# work
But when I do
import a
import b
It imports fine, but when running the code, it reports that the type b.the_type in a is not the b.the_type in b. A close examination with gdb gives me that the PyTypeObject of that type in a.so and b.so have two different addresses (and different refcnt).
My question is how do I enforce the loading order, or make sure that both ways work.
In order to make it possible for people who know well about shared libraries but not python to help me, here's some extra information. In python extensions, a python type is essentially a unique global variable that is initialized in its module (the .so file). Types MUST be initialized before it can be used (this is done by a call to python API). These required initialization is wrapped within specific function that has a particular name. Python will call this function when it loads the extension.
My guess is that, as the OS knows that a.so depends on b.so, the system loads b (instead of python) when python requests only a.so. Yet it is python's responsibility to call the module initialization function and python doesn't know a depends on b, so OS only loads b without initializing. On import b, when python then actually calls the module initialization function, it results in a different PyTypeObject.
If the solution is platform-dependent, my project is currently running on linux (archlinux).
You appear to have linked a to b to import the types b defines. Don't do this.
Instead, import b like you would any other Python module. In other words, the dependency on b should be handled entirely by the Python binary, not by your OS's dynamic library loading structures.
Use the C-API import functions to import b. At that point it should not matter how b is imported; it's just a bunch of Python objects from that point onwards.
That's not to say that b can't produce a C-level API for those objects (NumPy does this too), you just have to make sure that it is Python that loads the extension, not your library. Incidentally, NumPy defines helper functions that do the importing for you, see the import_umath() code generator for an example.

Import statement: Config file Python

I'm maintaining a dictionary and that is loaded inside the config file. The dictionary is loaded from a JSON file.
In config.py
name_dict = json.load(open(dict_file))
I'm importing this config file in several other scripts(file1.py, file2.py,...,filen.py) using
import config
statement. My question is when will the config.py script be executed ? I'm sure it wont be executed for every import call that is made inside my multiple scripts. But, what exactly happens when an import statement is called.
The top-level code in a module is executed once, the first time you import it. After that, the module object will be found in sys.modules, and the code will not be re-executed to re-generate it.
There are a few exceptions to this:
reload, obviously.
Accidentally importing the same module under two different names (e.g., if the module is in a package, and you've got some directory in the middle of the package in sys.path, you could end up with mypackage.mymodule and mymodule being two copies of the same thing, in which case the code gets run twice).
Installing import hooks/custom imported that replace the standard behavior.
Explicitly monkeying with sys.modules.
Directly calling functions out of imp/importlib or the like.
Certain cases with multiprocessing (and modules that use it indirectly, like concurrent.futures).
For Python 3.1 and later, this is all described in detail under The import system. In particular, look at the Searching section. (The multiprocessing-specific cases are described for that module.)
For earlier versions of Python, you pretty much have to infer the behavior from a variety of different sources and either reading the code or experimenting. However, the well-documented new behavior is intended to work like the old behavior except in specifically described ways, so you can usually get away with reading the 3.x docs even for 2.x.
Note that in general, you don't want to rely on whether top-level code in the module is run once or multiple times. For example, given a top-level function definition, as long as you never compare function objects, or rebind any globals that it (meaning the definition itself, not just the body) depends on, it doesn't make any difference. However, there are some exceptions to that, and loading start-time config files is a perfect example of an exception.

What happens when you import a package?

For efficiency's sake I am trying to figure out how python works with its heap of objects (and system of namespaces, but it is more or less clear). So, basically, I am trying to understand when objects are loaded into the heap, how many of them are there, how long they live etc.
And my question is when I work with a package and import something from it:
from pypackage import pymodule
what objects get loaded into the memory (into the object heap of the python interpreter)? And more generally: what happens? :)
I guess the above example does something like:
some object of the package pypackage was created in the memory (which contains some information about the package but not too much), the module pymodule was loaded into the memory and its reference was created in the local name space. The important thing here is: no other modules of the pypackage (or other objects) were created in the memory, unless it is stated explicitly (in the module itself, or somewhere in the package initialization tricks and hooks, which I am not familiar with). At the end the only one big thing in the memory is pymodule (i.e. all the objects that were created when the module was imported). Is it so? I would appreciate if someone clarified this matter a little bit. Maybe you could advice some useful article about it? (documentation covers more particular things)
I have found the following to the same question about the modules import:
When Python imports a module, it first checks the module registry (sys.modules) to see if the module is already imported. If that’s the case, Python uses the existing module object as is.
Otherwise, Python does something like this:
Create a new, empty module object (this is essentially a dictionary)
Insert that module object in the sys.modules dictionary
Load the module code object (if necessary, compile the module first)
Execute the module code object in the new module’s namespace. All variables assigned by the code will be available via the module object.
And would be grateful for the same kind of explanation about packages.
By the way, with packages a module name is added into the sys.modules oddly:
>>> import sys
>>> from pypacket import pymodule
>>> "pymodule" in sys.modules.keys()
False
>>> "pypacket" in sys.modules.keys()
True
And also there is a practical question concerning the same matter.
When I build a set of tools, which might be used in different processes and programs. And I put them in modules. I have no choice but to load a full module even when all I want is to use only one function declared there. As I see one can make this problem less painful by making small modules and putting them into a package (if a package doesn't load all of its modules when you import only one of them).
Is there a better way to make such libraries in Python? (With the mere functions, which don't have any dependencies within their module.) Is it possible with C-extensions?
PS sorry for such a long question.
You have a few different questions here. . .
About importing packages
When you import a package, the sequence of steps is the same as when you import a module. The only difference is that the packages's code (i.e., the code that creates the "module code object") is the code of the package's __init__.py.
So yes, the sub-modules of the package are not loaded unless the __init__.py does so explicitly. If you do from package import module, only module is loaded, unless of course it imports other modules from the package.
sys.modules names of modules loaded from packages
When you import a module from a package, the name is that is added to sys.modules is the "qualified name" that specifies the module name together with the dot-separated names of any packages you imported it from. So if you do from package.subpackage import mod, what is added to sys.modules is "package.subpackage.mod".
Importing only part of a module
It is usually not a big concern to have to import the whole module instead of just one function. You say it is "painful" but in practice it almost never is.
If, as you say, the functions have no external dependencies, then they are just pure Python and loading them will not take much time. Usually, if importing a module takes a long time, it's because it loads other modules, which means it does have external dependencies and you have to load the whole thing.
If your module has expensive operations that happen on module import (i.e., they are global module-level code and not inside a function), but aren't essential for use of all functions in the module, then you could, if you like, redesign your module to defer that loading until later. That is, if your module does something like:
def simpleFunction():
pass
# open files, read huge amounts of data, do slow stuff here
you can change it to
def simpleFunction():
pass
def loadData():
# open files, read huge amounts of data, do slow stuff here
and then tell people "call someModule.loadData() when you want to load the data". Or, as you suggested, you could put the expensive parts of the module into their own separate module within a package.
I've never found it to be the case that importing a module caused a meaningful performance impact unless the module was already large enough that it could reasonably be broken down into smaller modules. Making tons of tiny modules that each contain one function is unlikely to gain you anything except maintenance headaches from having to keep track of all those files. Do you actually have a specific situation where this makes a difference for you?
Also, regarding your last point, as far as I'm aware, the same all-or-nothing load strategy applies to C extension modules as for pure Python modules. Obviously, just like with Python modules, you could split things up into smaller extension modules, but you can't do from someExtensionModule import someFunction without also running the rest of the code that was packaged as part of that extension module.
The approximate sequence of steps that occurs when a module is imported is as follows:
Python tries to locate the module in sys.modules and does nothing else if it is found. Packages are keyed by their full name, so while pymodule is missing from sys.modules, pypacket.pymodule will be there (and can be obtained as sys.modules["pypacket.pymodule"].
Python locates the file that implements the module. If the module is part of the package, as determined by the x.y syntax, it will look for directories named x that contain both an __init__.py and y.py (or further subpackages). The bottom-most file located will be either a .py file, a .pyc file, or a .so/.pyd file. If no file that fits the module is found, an ImportError will be raised.
An empty module object is created, and the code in the module is executed with the module's __dict__ as the execution namespace.1
The module object is placed in sys.modules, and injected into the importer's namespace.
Step 3 is the point at which "objects get loaded into memory": the objects in question are the module object, and the contents of the namespace contained in its __dict__. This dict typically contains top-level functions and classes created as a side effect of executing all the def, class, and other top-level statements normally contained in each module.
Note that the above only desribes the default implementation of import. There is a number of ways one can customize import behavior, for example by overriding the __import__ built-in or by implementing import hooks.
1 If the module file is a .py source file, it will be compiled into memory first, and the code objects resulting from the compilation will be executed. If it is a .pyc, the code objects will be obtained by deserializing the file contents. If the module is a .so or a .pyd shared library, it will be loaded using the operating system's shared-library loading facility, and the init<module> C function will be invoked to initialize the module.

Python docstring search - similar to MATLAB `lookup` or Linux `apropos`

Is there a way to perform keyword searching of module and function docstrings from the interpreter?
Often, when I want to do something in Python, I know there's a module that does what I want, but I don't know what it's called. I would like a way of searching for "the name of the function or module that does X" without having to Google "python do X".
Take the example of "how can I open an URL"? At a Linux shell, I might try >> apropos open url. Under MATLAB, I might try >> lookup open url. Both of these would give me listings of functions or modules that include the words 'open' and 'URL' somewhere in their man page or doc string. For example:
urllib.urlopen : Create a file-like object for the specified URL to read from.
urllib2.urlopen : ...
...
I'd like something that searches through all installed modules, not just the modules that have been imported into my current session.
Yes, Google is a great way to search Python doc strings, but the latency is a bit high. ;)
The built-in support for that comes from pydoc.apropos:
import pydoc
pydoc.apropos('Zip')
# output: zipimport - zipimport provides support for importing Python modules from Zip archives.
Which, as you can see, is nearly useless. It also stops working whenever a module cannot be imported, which might mean 'always' depending on your package management style.
An alternative that I haven't used but looks promising is apropos.py:
Deep, dirty, and exhaustive 'apropos' for Python. Crawls all
libraries in the system.path, and matches a querystring against all
module names, as well as class, function, and method names and
docstrings of the top-level modules.
Usage: ./apropos.py
This module was created due to the limits of PyDoc's apropos method,
which fails hard when one of the top level modules cannot be imported.
The native apropos method also does not crawl docstrings, or deep
module names, which this module does.
Use the following command:
pydoc -k

Categories

Resources