How do I manipulate the default import path of a Python Project? - python

This is a little complex, I'll try to simplify as much as possible.
I want to create .pyi files for Brython, in a separate, installable, package.
Consider that I have a main project with a Brython file, myfile.py
The file should start with an import exactly like this -- exactly.
from browser import document
In my installable project, I have a structure such as this;
../brython-signatures/src/
└── brython
├── browser
│   ├── document.pyi
│   └── __init__.py
└── __init__.py
I install this second tool (during testing) with the following;
pip install -e ~/Projects/brython-signatures
This forces me to use the following import statement;
from brython.browser import document
For reasons beyond the scope of this question, I do not want to change the structure (e.g. make 'browser' the root package).
Question Is there a clever way to make my main application think it should import browser instead of brython.browser.
I tried adding package-dir entry to the installable project.
[package-dir]
browser="src/brython/browser"

Related

Shared class and function code management in Python

I'm new to Python for the last 3 months but have quite a bit of development
experience. I'm trying to figure out a good way to manage a collection of
functions and classes that are shared across several projects. I'm working on
a Windows 10 machine that I don't have admin access to, so the PATH variable is not an option. I'm using VS Code with Python 3.10
I have several active projects and my current working directory structure is:
python
Project A
Project B
Project C
Common
__init__.py (empty)
ClassA.py
ClassB.py
functions.py
I've added a .pth file in AppData/Local/Programs/Python/Python310/Lib/site-packages
which contains the path to the python root folder.
Right now I'm able to use this configuration by importing each file as a separate module:
from Common.ClassA import ClassA
from Common.ClassB import ClassB
from Common import functions as fn
I would like to do something like:
from Common import ClassA, ClassB, functions as fn
Just looking for some experienced advice on how to manage this situation. Thanks to any and
all who have time to respond.
(disclaimer, I am an admin on my mac, but none of what I am doing here required sudo permissions).
One way to do that is to put your common code in a "package", say common and use pip to do an editable local install. via pip install -e common.
After installation, your Python path is modified so that it includes the directory where common lives and your project-side code can then use it like:
from common.classa import ClassA
Now, writing an installable package is not that trivial, but TRIVIAL nowadays and this is likely the more robust approach, over modifying pythonpath with .pth files - been there, done that myself.
Now, as far as what your imports can look like in your project A, B, C code you will find that many packages do an import of constituent files in their __init__.py.
common/__init__.py:
from .classa import ClassA
from .classb import ClassB
import .functions
which means ProjectB can use:
import common
a = common.ClassA()
res = common.functions.foobar(42)
You can look at sqlalchemy's init.py for that type of approach:
from .engine import AdaptedConnection as AdaptedConnection
from .engine import BaseRow as BaseRow
from .engine import BindTyping as BindTyping
from .engine import ChunkedIteratorResult as ChunkedIteratorResult
from .engine import Compiled as Compiled
from .engine import Connection as Connection
which my own code can then use as:
import sqlalchemy
...
if not isinstance(engine, sqlalchemy.engine.base.Engine):
...
Note: none of this explanation should be taken as detracting from the comments and answers reminding you that Python can put any number of functions and classes into the same .py file. Python is not Java. But in practice, a Python file with over 400-500 lines of code is probably looking for a bit of refactoring. Not least because that facilitates git-based merging if those become relevant. And also because it facilitates code discovery: "Ah, a formatting question. Let's look in formatting.py"
OK, so how much work IS setting up a locally installed package?
TLDR: very little nowadays.
Let's take the package structure and Python files first
(this is under a directory called testpip)
├── common
│   ├── __init__.py
│   └── classa.py
common/__init__.py:
from .classa import A
from pathlib import Path
class B:
def __repr__(self) -> str:
return f"{self.__class__.__name__}"
def info(self):
pa = Path(__file__)
print(f"{self} in {pa.relative_to(pa.parent)}")
common/classa.py:
class A:
def __repr__(self) -> str:
return f"{self.__class__.__name__}"
def whoami(self):
print(f"{self}")
Let's start with just that and try a pip install.
testpip % pip install -e .
Obtaining file:.../testpip
ERROR: file:.../testpip does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.
OK, I know setup.py used to be complicated, but what about that pyproject.toml?
There's a write up about a minimal pyproject.toml here.
But that still seemed like a lot of stuff, so I ended up with.
echo "" > pyproject.toml. i.e. an empty pyproject.toml
(yes, a touch would do, but the OP is on Windows)
testpip % pip install -e .
Obtaining file:///.../testpip
Installing build dependencies ... done
Checking if build backend supports build_editable ... done
Getting requirements to build editable ... done
Preparing editable metadata (pyproject.toml) ... done
Building wheels for collected packages: common
Building editable for common (pyproject.toml) ... done
Created wheel for common: filename=common-0.0.0-0.editable-py3-none-any.whl size=2266 sha256=fe01aa92de3160527136d13a233bfd9ff92da973040981631a4bb8f372adbb0b
Stored in directory: /private/var/folders/bk/_1cwm6dj3h1c0ptrhvr2v7dc0000gs/T/pip-ephem-wheel-cache-l3v8jbfr/wheels/1a/64/41/6ec6e2e75e362f2818c47a49356f82be33b0a6dba83b41354c
Successfully built common
Installing collected packages: common
Successfully installed common-0.0.0
And now, let's go to another directory and try it out.
src/testimporter.py:
from common import A, B
a = A()
a.whoami()
b = B()
b.info()
python testimporter.py:
A
B in __init__.py
The full project structure ended up as:
.
├── common 👈 your Python code
│   ├── __init__.py
│   └── classa.py
├── common.egg-info 👈 generated by pip install -e .
│   ├── PKG-INFO
│   ├── SOURCES.txt
│   ├── dependency_links.txt
│   └── top_level.txt
├── pyproject.toml 👈 an EMPTY file to make pip install -e work
The easiest way would be to package everything in Common into a single .py file in the same folder as your projects.
The reasoning is that when you do
from Common.ClassA import ClassA
It looks in the Common folder, finds the ClassA file, and imports the ClassA class.
By organizing your directory structure like this:
Project A
Project B
Project C
Common.py
Then you can just run:
from Common import ClassA, ClassB, functions as fn

most reliable way to open file within python project [duplicate]

Could you tell me how can I read a file that is inside my Python package?
My situation
A package that I load has a number of templates (text files used as strings) that I want to load from within the program. But how do I specify the path to such file?
Imagine I want to read a file from:
package\templates\temp_file
Some kind of path manipulation? Package base path tracking?
TLDR; Use standard-library's importlib.resources module as explained in the method no 2, below.
The traditional pkg_resources from setuptools is not recommended anymore because the new method:
it is significantly more performant;
is is safer since the use of packages (instead of path-stings) raises compile-time errors;
it is more intuitive because you don't have to "join" paths;
it is faster when developing since you don't need an extra dependency (setuptools), but rely on Python's standard-library alone.
I kept the traditional listed first, to explain the differences with the new method when porting existing code (porting also explained here).
Let's assume your templates are located in a folder nested inside your module's package:
<your-package>
+--<module-asking-the-file>
+--templates/
+--temp_file <-- We want this file.
Note 1: For sure, we should NOT fiddle with the __file__ attribute (e.g. code will break when served from a zip).
Note 2: If you are building this package, remember to declatre your data files as package_data or data_files in your setup.py.
1) Using pkg_resources from setuptools(slow)
You may use pkg_resources package from setuptools distribution, but that comes with a cost, performance-wise:
import pkg_resources
# Could be any dot-separated package/module name or a "Requirement"
resource_package = __name__
resource_path = '/'.join(('templates', 'temp_file')) # Do not use os.path.join()
template = pkg_resources.resource_string(resource_package, resource_path)
# or for a file-like stream:
template = pkg_resources.resource_stream(resource_package, resource_path)
Tips:
This will read data even if your distribution is zipped, so you may set zip_safe=True in your setup.py, and/or use the long-awaited zipapp packer from python-3.5 to create self-contained distributions.
Remember to add setuptools into your run-time requirements (e.g. in install_requires`).
... and notice that according to the Setuptools/pkg_resources docs, you should not use os.path.join:
Basic Resource Access
Note that resource names must be /-separated paths and cannot be absolute (i.e. no leading /) or contain relative names like "..". Do not use os.path routines to manipulate resource paths, as they are not filesystem paths.
2) Python >= 3.7, or using the backported importlib_resources library
Use the standard library's importlib.resources module which is more efficient than setuptools, above:
try:
import importlib.resources as pkg_resources
except ImportError:
# Try backported to PY<37 `importlib_resources`.
import importlib_resources as pkg_resources
from . import templates # relative-import the *package* containing the templates
template = pkg_resources.read_text(templates, 'temp_file')
# or for a file-like stream:
template = pkg_resources.open_text(templates, 'temp_file')
Attention:
Regarding the function read_text(package, resource):
The package can be either a string or a module.
The resource is NOT a path anymore, but just the filename of the resource to open, within an existing package; it may not contain path separators and it may not have sub-resources (i.e. it cannot be a directory).
For the example asked in the question, we must now:
make the <your_package>/templates/ into a proper package, by creating an empty __init__.py file in it,
so now we can use a simple (possibly relative) import statement (no more parsing package/module names),
and simply ask for resource_name = "temp_file" (no path).
Tips:
To access a file inside the current module, set the package argument to __package__, e.g. pkg_resources.read_text(__package__, 'temp_file') (thanks to #ben-mares).
Things become interesting when an actual filename is asked with path(), since now context-managers are used for temporarily-created files (read this).
Add the backported library, conditionally for older Pythons, with install_requires=[" importlib_resources ; python_version<'3.7'"] (check this if you package your project with setuptools<36.2.1).
Remember to remove setuptools library from your runtime-requirements, if you migrated from the traditional method.
Remember to customize setup.py or MANIFEST to include any static files.
You may also set zip_safe=True in your setup.py.
A packaging prelude:
Before you can even worry about reading resource files, the first step is to make sure that the data files are getting packaged into your distribution in the first place - it is easy to read them directly from the source tree, but the important part is making sure these resource files are accessible from code within an installed package.
Structure your project like this, putting data files into a subdirectory within the package:
.
├── package
│   ├── __init__.py
│   ├── templates
│   │   └── temp_file
│   ├── mymodule1.py
│   └── mymodule2.py
├── README.rst
├── MANIFEST.in
└── setup.py
You should pass include_package_data=True in the setup() call. The manifest file is only needed if you want to use setuptools/distutils and build source distributions. To make sure the templates/temp_file gets packaged for this example project structure, add a line like this into the manifest file:
recursive-include package *
Historical cruft note: Using a manifest file is not needed for modern build backends such as flit, poetry, which will include the package data files by default. So, if you're using pyproject.toml and you don't have a setup.py file then you can ignore all the stuff about MANIFEST.in.
Now, with packaging out of the way, onto the reading part...
Recommendation:
Use standard library pkgutil APIs. It's going to look like this in library code:
# within package/mymodule1.py, for example
import pkgutil
data = pkgutil.get_data(__name__, "templates/temp_file")
It works in zips. It works on Python 2 and Python 3. It doesn't require third-party dependencies. I'm not really aware of any downsides (if you are, then please comment on the answer).
Bad ways to avoid:
Bad way #1: using relative paths from a source file
This is currently the accepted answer. At best, it looks something like this:
from pathlib import Path
resource_path = Path(__file__).parent / "templates"
data = resource_path.joinpath("temp_file").read_bytes()
What's wrong with that? The assumption that you have files and subdirectories available is not correct. This approach doesn't work if executing code which is packed in a zip or a wheel, and it may be entirely out of the user's control whether or not your package gets extracted to a filesystem at all.
Bad way #2: using pkg_resources APIs
This is described in the top-voted answer. It looks something like this:
from pkg_resources import resource_string
data = resource_string(__name__, "templates/temp_file")
What's wrong with that? It adds a runtime dependency on setuptools, which should preferably be an install time dependency only. Importing and using pkg_resources can become really slow, as the code builds up a working set of all installed packages, even though you were only interested in your own package resources. That's not a big deal at install time (since installation is once-off), but it's ugly at runtime.
Bad way #3: using legacy importlib.resources APIs
This is currently the recommendation in the top-voted answer. It's in the standard library since Python 3.7. It looks like this:
from importlib.resources import read_binary
data = read_binary("package.templates", "temp_file")
What's wrong with that? Well, unfortunately, the implementation left some things to be desired and it is likely to be was deprecated in Python 3.11. Using importlib.resources.read_binary, importlib.resources.read_text and friends will require you to add an empty file templates/__init__.py so that data files reside within a sub-package rather than in a subdirectory. It will also expose the package/templates subdirectory as an importable package.templates sub-package in its own right. This won't work with many existing packages which are already published using resource subdirectories instead of resource sub-packages, and it's inconvenient to add the __init__.py files everywhere muddying the boundary between data and code.
This approach was deprecated in upstream importlib_resources in 2021, and was deprecated in stdlib from version Python 3.11. bpo-45514 tracked the deprecation and migrating from legacy offers _legacy.py wrappers to aid with transition.
Honorable mention: using newer importlib_resources APIs
This has not been mentioned in any other answers yet, but importlib_resources is more than a simple backport of the Python 3.7+ importlib.resources code. It has traversable APIs which you can use like this:
import importlib_resources
my_resources = importlib_resources.files("package")
data = (my_resources / "templates" / "temp_file").read_bytes()
This works on Python 2 and 3, it works in zips, and it doesn't require spurious __init__.py files to be added in resource subdirectories. The only downside vs pkgutil that I can see is that these new APIs are only available in the stdlib for Python-3.9+, so there is still a third-party dependency needed to support older Python versions. If you only need to run on Python-3.9+ then use this approach, or you can add a compatibility layer and a conditional dependency on the backport for older Python versions:
# in your library code:
try:
from importlib.resources import files
except ImportError:
from importlib_resources import files
# in your setup.py or similar:
from setuptools import setup
setup(
...
install_requires=[
'importlib_resources; python_version < "3.9"',
]
)
Example project:
I've created an example project on github and uploaded on PyPI, which demonstrates all five approaches discussed above. Try it out with:
$ pip install resources-example
$ resources-example
See https://github.com/wimglenn/resources-example for more info.
The content in "10.8. Reading Datafiles Within a Package" of Python Cookbook, Third Edition by David Beazley and Brian K. Jones giving the answers.
I'll just get it to here:
Suppose you have a package with files organized as follows:
mypackage/
__init__.py
somedata.dat
spam.py
Now suppose the file spam.py wants to read the contents of the file somedata.dat. To do
it, use the following code:
import pkgutil
data = pkgutil.get_data(__package__, 'somedata.dat')
The resulting variable data will be a byte string containing the raw contents of the file.
The first argument to get_data() is a string containing the package name. You can
either supply it directly or use a special variable, such as __package__. The second
argument is the relative name of the file within the package. If necessary, you can navigate
into different directories using standard Unix filename conventions as long as the
final directory is still located within the package.
In this way, the package can installed as directory, .zip or .egg.
In case you have this structure
lidtk
├── bin
│   └── lidtk
├── lidtk
│   ├── analysis
│   │   ├── char_distribution.py
│   │   └── create_cm.py
│   ├── classifiers
│   │   ├── char_dist_metric_train_test.py
│   │   ├── char_features.py
│   │   ├── cld2
│   │   │   ├── cld2_preds.txt
│   │   │   └── cld2wili.py
│   │   ├── get_cld2.py
│   │   ├── text_cat
│   │   │   ├── __init__.py
│   │   │   ├── README.md <---------- say you want to get this
│   │   │   └── textcat_ngram.py
│   │   └── tfidf_features.py
│   ├── data
│   │   ├── __init__.py
│   │   ├── create_ml_dataset.py
│   │   ├── download_documents.py
│   │   ├── language_utils.py
│   │   ├── pickle_to_txt.py
│   │   └── wili.py
│   ├── __init__.py
│   ├── get_predictions.py
│   ├── languages.csv
│   └── utils.py
├── README.md
├── setup.cfg
└── setup.py
you need this code:
import pkg_resources
# __name__ in case you're within the package
# - otherwise it would be 'lidtk' in this example as it is the package name
path = 'classifiers/text_cat/README.md' # always use slash
filepath = pkg_resources.resource_filename(__name__, path)
The strange "always use slash" part comes from setuptools APIs
Also notice that if you use paths, you must use a forward slash (/) as the path separator, even if you are on Windows. Setuptools automatically converts slashes to appropriate platform-specific separators at build time
In case you wonder where the documentation is:
PEP 0365
https://packaging.python.org/guides/single-sourcing-package-version/
The accepted answer should be to use importlib.resources. pkgutil.get_data also requires the argument package be a non-namespace package (see pkgutil docs). Hence, the directory containing the resource must have an __init__.py file, making it have the exact same limitations as importlib.resources. If the overhead issue of pkg_resources is not a concern, this is also an acceptable alternative.
Pre-Python-3.3, all packages were required to have an __init__.py. Post-Python-3.3, a folder doesn't need an __init__.py to be a package. This is called a namespace package. Unfortunately, pkgutil does not work with namespace packages (see pkgutil docs).
For example, with the package structure:
+-- foo/
| +-- __init__.py
| +-- bar/
| | +-- hi.txt
where hi.txt just has Hi!, you get the following
>>> import pkgutil
>>> rsrc = pkgutil.get_data("foo.bar", "hi.txt")
>>> print(rsrc)
None
However, with an __init__.py in bar, you get
>>> import pkgutil
>>> rsrc = pkgutil.get_data("foo.bar", "hi.txt")
>>> print(rsrc)
b'Hi!'
assuming you are using an egg file; not extracted:
I "solved" this in a recent project, by using a postinstall script, that extracts my templates from the egg (zip file) to the proper directory in the filesystem. It was the quickest, most reliable solution I found, since working with __path__[0] can go wrong sometimes (i don't recall the name, but i cam across at least one library, that added something in front of that list!).
Also egg files are usually extracted on the fly to a temporary location called the "egg cache". You can change that location using an environment variable, either before starting your script or even later, eg.
os.environ['PYTHON_EGG_CACHE'] = path
However there is pkg_resources that might do the job properly.

Is it good practice to design a Python package so that it does not import all its modules at once?

I have developed a fairly heavy Python application that has many dependencies. It takes a couple of seconds to be imported, and I was thinking in ways to reduce this load time.
The current structure is something like this:
my_package
├── __init__.py
├── cli.py
├── config.py
├── utils.py
├── subpackage_b.py
│   ├──__init__.py
│  └──module_b1.py
└── subpackage_c.py
   └──__init__.py
└──module_c1.py
My idea is that cli.py implements the command line interface to the application, whereas config.py stores global variables with configuration of the application. The logic of the application is in the common utils.py, as well as the two subpackages, which group classes with similar functionality.
Right now, when you import my_package, __init__.py basically imports the rest of the modules. Similarly, the __init__.py in each subpackage imports all classes to the corresponding subpackage namespace. The result is that importing my_package gives you a namespace with access to all the API via . notation.
Is this a good design? A small problem I'm having is that the application actually can do 2 different tasks (mostly related to either subpackage subpackage_a or to subpackage subpackage_b, I never need to use both in the same execution). I was thinking that a better design could be to not import the subpackages. The main __init__.py would just load the configuration and that's all. Then, cli.py (which parses the arguments given to the application) will take care to import either subpackage_a or subpackage_b depending on the intended use of the application. I think this would reduce importing times, as well as memory usage because a good part of the application is not loaded, and was not going to be used anyway.
This would, however, go against the Python mantra of importing all modules above, not in between the code (after parsing the command line). Would this new approach be considered "bad", even if it has desirable good properties? What could possibly go wrong?

What is a very *simple* way to structure a python project?

So I have this python thing that needs to process a file.
First it was:
my_project/
├── script.py
And I would simply run it with python script.py file.csv.
Then it grew and became:
my_project/
├── script.py
├── util/
│ └── string_util.py
├── services/
│ └── my_service.py
(There is an empty __init__.pyin every directory)
But now my_service.py would like to use string_util.py and it's so damn not straightforward how to do this nicely.
I would like to do from ..util import string_util in my_service.py (which is imported into script.py with from services import my_service), but that does not work with python script.py since my_service's __name__ is then only services.my_service (and I get the Attempted relative import beyond toplevel package)
I can do cd .. and python -m my_project.script, but that seems so unnatural and would be really bad to put it in the README for the instructions how to run this.
Right now I'm solving it with the ugly sys.path.append() hack.
What other options do I have?
This is bordering on opinion, but I'll share my take on this.
You should look at your project a different way. Choose one execution point, and reference your imports from there, to avoid all of the odd relative imports you are trying to work around. So, looking at your project structure:
my_project/
├── script.py
├── util/
│ └── string_util.py
├── services/
│ └── my_service.py
As you are currently doing, execute your code from within my_project. That way all your imports should be with respect to that point. Therefore, your imports actually look like this:
# my_service.py
from util.string_util import foo
Another way to think about this, is that if you are moving your project around, or have a CI, you need to make sure you specify what is the project root you want to execute from. Keeping these things in mind, and specifying the single execution point of where your project should be executed, will make your life much easier when it comes to dealing with structuring your packages and modules and referencing them appropriately, allowing other systems to properly use your project without having to deal with odd relative imports.
Hope this helps.

How to read a (static) file from inside a Python package?

Could you tell me how can I read a file that is inside my Python package?
My situation
A package that I load has a number of templates (text files used as strings) that I want to load from within the program. But how do I specify the path to such file?
Imagine I want to read a file from:
package\templates\temp_file
Some kind of path manipulation? Package base path tracking?
TLDR; Use standard-library's importlib.resources module as explained in the method no 2, below.
The traditional pkg_resources from setuptools is not recommended anymore because the new method:
it is significantly more performant;
is is safer since the use of packages (instead of path-stings) raises compile-time errors;
it is more intuitive because you don't have to "join" paths;
it is faster when developing since you don't need an extra dependency (setuptools), but rely on Python's standard-library alone.
I kept the traditional listed first, to explain the differences with the new method when porting existing code (porting also explained here).
Let's assume your templates are located in a folder nested inside your module's package:
<your-package>
+--<module-asking-the-file>
+--templates/
+--temp_file <-- We want this file.
Note 1: For sure, we should NOT fiddle with the __file__ attribute (e.g. code will break when served from a zip).
Note 2: If you are building this package, remember to declatre your data files as package_data or data_files in your setup.py.
1) Using pkg_resources from setuptools(slow)
You may use pkg_resources package from setuptools distribution, but that comes with a cost, performance-wise:
import pkg_resources
# Could be any dot-separated package/module name or a "Requirement"
resource_package = __name__
resource_path = '/'.join(('templates', 'temp_file')) # Do not use os.path.join()
template = pkg_resources.resource_string(resource_package, resource_path)
# or for a file-like stream:
template = pkg_resources.resource_stream(resource_package, resource_path)
Tips:
This will read data even if your distribution is zipped, so you may set zip_safe=True in your setup.py, and/or use the long-awaited zipapp packer from python-3.5 to create self-contained distributions.
Remember to add setuptools into your run-time requirements (e.g. in install_requires`).
... and notice that according to the Setuptools/pkg_resources docs, you should not use os.path.join:
Basic Resource Access
Note that resource names must be /-separated paths and cannot be absolute (i.e. no leading /) or contain relative names like "..". Do not use os.path routines to manipulate resource paths, as they are not filesystem paths.
2) Python >= 3.7, or using the backported importlib_resources library
Use the standard library's importlib.resources module which is more efficient than setuptools, above:
try:
import importlib.resources as pkg_resources
except ImportError:
# Try backported to PY<37 `importlib_resources`.
import importlib_resources as pkg_resources
from . import templates # relative-import the *package* containing the templates
template = pkg_resources.read_text(templates, 'temp_file')
# or for a file-like stream:
template = pkg_resources.open_text(templates, 'temp_file')
Attention:
Regarding the function read_text(package, resource):
The package can be either a string or a module.
The resource is NOT a path anymore, but just the filename of the resource to open, within an existing package; it may not contain path separators and it may not have sub-resources (i.e. it cannot be a directory).
For the example asked in the question, we must now:
make the <your_package>/templates/ into a proper package, by creating an empty __init__.py file in it,
so now we can use a simple (possibly relative) import statement (no more parsing package/module names),
and simply ask for resource_name = "temp_file" (no path).
Tips:
To access a file inside the current module, set the package argument to __package__, e.g. pkg_resources.read_text(__package__, 'temp_file') (thanks to #ben-mares).
Things become interesting when an actual filename is asked with path(), since now context-managers are used for temporarily-created files (read this).
Add the backported library, conditionally for older Pythons, with install_requires=[" importlib_resources ; python_version<'3.7'"] (check this if you package your project with setuptools<36.2.1).
Remember to remove setuptools library from your runtime-requirements, if you migrated from the traditional method.
Remember to customize setup.py or MANIFEST to include any static files.
You may also set zip_safe=True in your setup.py.
A packaging prelude:
Before you can even worry about reading resource files, the first step is to make sure that the data files are getting packaged into your distribution in the first place - it is easy to read them directly from the source tree, but the important part is making sure these resource files are accessible from code within an installed package.
Structure your project like this, putting data files into a subdirectory within the package:
.
├── package
│   ├── __init__.py
│   ├── templates
│   │   └── temp_file
│   ├── mymodule1.py
│   └── mymodule2.py
├── README.rst
├── MANIFEST.in
└── setup.py
You should pass include_package_data=True in the setup() call. The manifest file is only needed if you want to use setuptools/distutils and build source distributions. To make sure the templates/temp_file gets packaged for this example project structure, add a line like this into the manifest file:
recursive-include package *
Historical cruft note: Using a manifest file is not needed for modern build backends such as flit, poetry, which will include the package data files by default. So, if you're using pyproject.toml and you don't have a setup.py file then you can ignore all the stuff about MANIFEST.in.
Now, with packaging out of the way, onto the reading part...
Recommendation:
Use standard library pkgutil APIs. It's going to look like this in library code:
# within package/mymodule1.py, for example
import pkgutil
data = pkgutil.get_data(__name__, "templates/temp_file")
It works in zips. It works on Python 2 and Python 3. It doesn't require third-party dependencies. I'm not really aware of any downsides (if you are, then please comment on the answer).
Bad ways to avoid:
Bad way #1: using relative paths from a source file
This is currently the accepted answer. At best, it looks something like this:
from pathlib import Path
resource_path = Path(__file__).parent / "templates"
data = resource_path.joinpath("temp_file").read_bytes()
What's wrong with that? The assumption that you have files and subdirectories available is not correct. This approach doesn't work if executing code which is packed in a zip or a wheel, and it may be entirely out of the user's control whether or not your package gets extracted to a filesystem at all.
Bad way #2: using pkg_resources APIs
This is described in the top-voted answer. It looks something like this:
from pkg_resources import resource_string
data = resource_string(__name__, "templates/temp_file")
What's wrong with that? It adds a runtime dependency on setuptools, which should preferably be an install time dependency only. Importing and using pkg_resources can become really slow, as the code builds up a working set of all installed packages, even though you were only interested in your own package resources. That's not a big deal at install time (since installation is once-off), but it's ugly at runtime.
Bad way #3: using legacy importlib.resources APIs
This is currently the recommendation in the top-voted answer. It's in the standard library since Python 3.7. It looks like this:
from importlib.resources import read_binary
data = read_binary("package.templates", "temp_file")
What's wrong with that? Well, unfortunately, the implementation left some things to be desired and it is likely to be was deprecated in Python 3.11. Using importlib.resources.read_binary, importlib.resources.read_text and friends will require you to add an empty file templates/__init__.py so that data files reside within a sub-package rather than in a subdirectory. It will also expose the package/templates subdirectory as an importable package.templates sub-package in its own right. This won't work with many existing packages which are already published using resource subdirectories instead of resource sub-packages, and it's inconvenient to add the __init__.py files everywhere muddying the boundary between data and code.
This approach was deprecated in upstream importlib_resources in 2021, and was deprecated in stdlib from version Python 3.11. bpo-45514 tracked the deprecation and migrating from legacy offers _legacy.py wrappers to aid with transition.
Honorable mention: using newer importlib_resources APIs
This has not been mentioned in any other answers yet, but importlib_resources is more than a simple backport of the Python 3.7+ importlib.resources code. It has traversable APIs which you can use like this:
import importlib_resources
my_resources = importlib_resources.files("package")
data = (my_resources / "templates" / "temp_file").read_bytes()
This works on Python 2 and 3, it works in zips, and it doesn't require spurious __init__.py files to be added in resource subdirectories. The only downside vs pkgutil that I can see is that these new APIs are only available in the stdlib for Python-3.9+, so there is still a third-party dependency needed to support older Python versions. If you only need to run on Python-3.9+ then use this approach, or you can add a compatibility layer and a conditional dependency on the backport for older Python versions:
# in your library code:
try:
from importlib.resources import files
except ImportError:
from importlib_resources import files
# in your setup.py or similar:
from setuptools import setup
setup(
...
install_requires=[
'importlib_resources; python_version < "3.9"',
]
)
Example project:
I've created an example project on github and uploaded on PyPI, which demonstrates all five approaches discussed above. Try it out with:
$ pip install resources-example
$ resources-example
See https://github.com/wimglenn/resources-example for more info.
The content in "10.8. Reading Datafiles Within a Package" of Python Cookbook, Third Edition by David Beazley and Brian K. Jones giving the answers.
I'll just get it to here:
Suppose you have a package with files organized as follows:
mypackage/
__init__.py
somedata.dat
spam.py
Now suppose the file spam.py wants to read the contents of the file somedata.dat. To do
it, use the following code:
import pkgutil
data = pkgutil.get_data(__package__, 'somedata.dat')
The resulting variable data will be a byte string containing the raw contents of the file.
The first argument to get_data() is a string containing the package name. You can
either supply it directly or use a special variable, such as __package__. The second
argument is the relative name of the file within the package. If necessary, you can navigate
into different directories using standard Unix filename conventions as long as the
final directory is still located within the package.
In this way, the package can installed as directory, .zip or .egg.
In case you have this structure
lidtk
├── bin
│   └── lidtk
├── lidtk
│   ├── analysis
│   │   ├── char_distribution.py
│   │   └── create_cm.py
│   ├── classifiers
│   │   ├── char_dist_metric_train_test.py
│   │   ├── char_features.py
│   │   ├── cld2
│   │   │   ├── cld2_preds.txt
│   │   │   └── cld2wili.py
│   │   ├── get_cld2.py
│   │   ├── text_cat
│   │   │   ├── __init__.py
│   │   │   ├── README.md <---------- say you want to get this
│   │   │   └── textcat_ngram.py
│   │   └── tfidf_features.py
│   ├── data
│   │   ├── __init__.py
│   │   ├── create_ml_dataset.py
│   │   ├── download_documents.py
│   │   ├── language_utils.py
│   │   ├── pickle_to_txt.py
│   │   └── wili.py
│   ├── __init__.py
│   ├── get_predictions.py
│   ├── languages.csv
│   └── utils.py
├── README.md
├── setup.cfg
└── setup.py
you need this code:
import pkg_resources
# __name__ in case you're within the package
# - otherwise it would be 'lidtk' in this example as it is the package name
path = 'classifiers/text_cat/README.md' # always use slash
filepath = pkg_resources.resource_filename(__name__, path)
The strange "always use slash" part comes from setuptools APIs
Also notice that if you use paths, you must use a forward slash (/) as the path separator, even if you are on Windows. Setuptools automatically converts slashes to appropriate platform-specific separators at build time
In case you wonder where the documentation is:
PEP 0365
https://packaging.python.org/guides/single-sourcing-package-version/
The accepted answer should be to use importlib.resources. pkgutil.get_data also requires the argument package be a non-namespace package (see pkgutil docs). Hence, the directory containing the resource must have an __init__.py file, making it have the exact same limitations as importlib.resources. If the overhead issue of pkg_resources is not a concern, this is also an acceptable alternative.
Pre-Python-3.3, all packages were required to have an __init__.py. Post-Python-3.3, a folder doesn't need an __init__.py to be a package. This is called a namespace package. Unfortunately, pkgutil does not work with namespace packages (see pkgutil docs).
For example, with the package structure:
+-- foo/
| +-- __init__.py
| +-- bar/
| | +-- hi.txt
where hi.txt just has Hi!, you get the following
>>> import pkgutil
>>> rsrc = pkgutil.get_data("foo.bar", "hi.txt")
>>> print(rsrc)
None
However, with an __init__.py in bar, you get
>>> import pkgutil
>>> rsrc = pkgutil.get_data("foo.bar", "hi.txt")
>>> print(rsrc)
b'Hi!'
assuming you are using an egg file; not extracted:
I "solved" this in a recent project, by using a postinstall script, that extracts my templates from the egg (zip file) to the proper directory in the filesystem. It was the quickest, most reliable solution I found, since working with __path__[0] can go wrong sometimes (i don't recall the name, but i cam across at least one library, that added something in front of that list!).
Also egg files are usually extracted on the fly to a temporary location called the "egg cache". You can change that location using an environment variable, either before starting your script or even later, eg.
os.environ['PYTHON_EGG_CACHE'] = path
However there is pkg_resources that might do the job properly.

Categories

Resources