including python file from project root in setup.py build - python

I am trying to include a python file in the build/lib directory created when running
python setup.py install
In particular, I would like to include a simple configuration file ('definitions.py') that defines a ROOT_DIR variable, which is then used by subpackages. The 'definitions.py' file contains:
import os
ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
My goal is to have configuration files within each subpackage ('config.py') call ROOT_DIR to build their own absolute paths:
from definitions import ROOT_DIR
PACKAGE_DIR = os.path.join(ROOT_DIR, 'package1/')
The idea is drawn from this stackoverflow answer: https://stackoverflow.com/a/25389715.
However, this 'definitions.py' file never shows up in the build directory when running 'setup.py install'.
Here is the directory structure of the project:
project
|
├── setup.py
|
├── definitions.py
|
├── package1
| ├── __init__.py
| ├── config.py
| └── ...
|
├── package2
| ├── __init__.py
| └── ...
└── ...
My multiple attempts have failed (trying, e.g. the suggestions offered in https://stackoverflow.com/a/11848281). As far as I can tell, it's because definitions.py is in the top-level of my project structure (which lacks an __init__.py file).
I have tried:
1) ...using the 'package-data' variable in setuptools.setup()
package_data={'package': ['./definitions.py']}
but definitions.py does not show up in the build (I think because definitions.py is not in a 'package' that has an __init__.py?).
2) ...using a MANIFEST.in file, but this also does not work(I think because MANIFEST does not work with .py files?)
My question:
Is there a way to include definitions.py in the build directory? Or, is there a better way to provide access to absolute paths built from the top-level directory for multiple sub-packages?

If you are looking for a way to access a non-python data file in the installed module like in the question you've linked (a configuration file in the top-level package that should be accessible in subpackages), use pkg_resources machinery instead of inventing a custom path resolution. An example project structure:
project
├── setup.py
└── root
├── __init__.py
├── config.txt
├── sub1
│ └── __init__.py
└── sub2
└── __init__.py
setup.py:
from setuptools import setup
setup(
name='myproj',
...,
packages=['root', 'root.sub1', 'root.sub2'], # or setuptools.find_packages()
package_data={'root': ['config.txt']}
)
Update:
As pointed out by wim in the comments, there's now a backport for importlib.resources (which is only available in Python 3.7 and onwards) - importlib_resources, which offers a modern resource machinery that utilizes pathlib:
# access the filepath
importlib_resources.path('root', 'config.txt')
# access the contents as string
importlib_resources.read_text('root', 'config.txt')
# access the contents as file-like object
importlib_resources.open_binary('root', 'config.txt')
Original answer
Using pkg_resources, you can access the root/config.txt from any spot of your package without having to perform any path resolution at all:
import pkg_resources
# access the filepath:
filepath = pkg_resources.resource_filename('root', 'config.txt')
# access the contents as string:
contents = pkg_resources.resource_string('root', 'config.txt')
# access the contents as file-like object:
contents = pkg_resources.resource_stream('root', 'config.txt')
etc.

Related

most reliable way to open file within python project [duplicate]

Could you tell me how can I read a file that is inside my Python package?
My situation
A package that I load has a number of templates (text files used as strings) that I want to load from within the program. But how do I specify the path to such file?
Imagine I want to read a file from:
package\templates\temp_file
Some kind of path manipulation? Package base path tracking?
TLDR; Use standard-library's importlib.resources module as explained in the method no 2, below.
The traditional pkg_resources from setuptools is not recommended anymore because the new method:
it is significantly more performant;
is is safer since the use of packages (instead of path-stings) raises compile-time errors;
it is more intuitive because you don't have to "join" paths;
it is faster when developing since you don't need an extra dependency (setuptools), but rely on Python's standard-library alone.
I kept the traditional listed first, to explain the differences with the new method when porting existing code (porting also explained here).
Let's assume your templates are located in a folder nested inside your module's package:
<your-package>
+--<module-asking-the-file>
+--templates/
+--temp_file <-- We want this file.
Note 1: For sure, we should NOT fiddle with the __file__ attribute (e.g. code will break when served from a zip).
Note 2: If you are building this package, remember to declatre your data files as package_data or data_files in your setup.py.
1) Using pkg_resources from setuptools(slow)
You may use pkg_resources package from setuptools distribution, but that comes with a cost, performance-wise:
import pkg_resources
# Could be any dot-separated package/module name or a "Requirement"
resource_package = __name__
resource_path = '/'.join(('templates', 'temp_file')) # Do not use os.path.join()
template = pkg_resources.resource_string(resource_package, resource_path)
# or for a file-like stream:
template = pkg_resources.resource_stream(resource_package, resource_path)
Tips:
This will read data even if your distribution is zipped, so you may set zip_safe=True in your setup.py, and/or use the long-awaited zipapp packer from python-3.5 to create self-contained distributions.
Remember to add setuptools into your run-time requirements (e.g. in install_requires`).
... and notice that according to the Setuptools/pkg_resources docs, you should not use os.path.join:
Basic Resource Access
Note that resource names must be /-separated paths and cannot be absolute (i.e. no leading /) or contain relative names like "..". Do not use os.path routines to manipulate resource paths, as they are not filesystem paths.
2) Python >= 3.7, or using the backported importlib_resources library
Use the standard library's importlib.resources module which is more efficient than setuptools, above:
try:
import importlib.resources as pkg_resources
except ImportError:
# Try backported to PY<37 `importlib_resources`.
import importlib_resources as pkg_resources
from . import templates # relative-import the *package* containing the templates
template = pkg_resources.read_text(templates, 'temp_file')
# or for a file-like stream:
template = pkg_resources.open_text(templates, 'temp_file')
Attention:
Regarding the function read_text(package, resource):
The package can be either a string or a module.
The resource is NOT a path anymore, but just the filename of the resource to open, within an existing package; it may not contain path separators and it may not have sub-resources (i.e. it cannot be a directory).
For the example asked in the question, we must now:
make the <your_package>/templates/ into a proper package, by creating an empty __init__.py file in it,
so now we can use a simple (possibly relative) import statement (no more parsing package/module names),
and simply ask for resource_name = "temp_file" (no path).
Tips:
To access a file inside the current module, set the package argument to __package__, e.g. pkg_resources.read_text(__package__, 'temp_file') (thanks to #ben-mares).
Things become interesting when an actual filename is asked with path(), since now context-managers are used for temporarily-created files (read this).
Add the backported library, conditionally for older Pythons, with install_requires=[" importlib_resources ; python_version<'3.7'"] (check this if you package your project with setuptools<36.2.1).
Remember to remove setuptools library from your runtime-requirements, if you migrated from the traditional method.
Remember to customize setup.py or MANIFEST to include any static files.
You may also set zip_safe=True in your setup.py.
A packaging prelude:
Before you can even worry about reading resource files, the first step is to make sure that the data files are getting packaged into your distribution in the first place - it is easy to read them directly from the source tree, but the important part is making sure these resource files are accessible from code within an installed package.
Structure your project like this, putting data files into a subdirectory within the package:
.
├── package
│   ├── __init__.py
│   ├── templates
│   │   └── temp_file
│   ├── mymodule1.py
│   └── mymodule2.py
├── README.rst
├── MANIFEST.in
└── setup.py
You should pass include_package_data=True in the setup() call. The manifest file is only needed if you want to use setuptools/distutils and build source distributions. To make sure the templates/temp_file gets packaged for this example project structure, add a line like this into the manifest file:
recursive-include package *
Historical cruft note: Using a manifest file is not needed for modern build backends such as flit, poetry, which will include the package data files by default. So, if you're using pyproject.toml and you don't have a setup.py file then you can ignore all the stuff about MANIFEST.in.
Now, with packaging out of the way, onto the reading part...
Recommendation:
Use standard library pkgutil APIs. It's going to look like this in library code:
# within package/mymodule1.py, for example
import pkgutil
data = pkgutil.get_data(__name__, "templates/temp_file")
It works in zips. It works on Python 2 and Python 3. It doesn't require third-party dependencies. I'm not really aware of any downsides (if you are, then please comment on the answer).
Bad ways to avoid:
Bad way #1: using relative paths from a source file
This is currently the accepted answer. At best, it looks something like this:
from pathlib import Path
resource_path = Path(__file__).parent / "templates"
data = resource_path.joinpath("temp_file").read_bytes()
What's wrong with that? The assumption that you have files and subdirectories available is not correct. This approach doesn't work if executing code which is packed in a zip or a wheel, and it may be entirely out of the user's control whether or not your package gets extracted to a filesystem at all.
Bad way #2: using pkg_resources APIs
This is described in the top-voted answer. It looks something like this:
from pkg_resources import resource_string
data = resource_string(__name__, "templates/temp_file")
What's wrong with that? It adds a runtime dependency on setuptools, which should preferably be an install time dependency only. Importing and using pkg_resources can become really slow, as the code builds up a working set of all installed packages, even though you were only interested in your own package resources. That's not a big deal at install time (since installation is once-off), but it's ugly at runtime.
Bad way #3: using legacy importlib.resources APIs
This is currently the recommendation in the top-voted answer. It's in the standard library since Python 3.7. It looks like this:
from importlib.resources import read_binary
data = read_binary("package.templates", "temp_file")
What's wrong with that? Well, unfortunately, the implementation left some things to be desired and it is likely to be was deprecated in Python 3.11. Using importlib.resources.read_binary, importlib.resources.read_text and friends will require you to add an empty file templates/__init__.py so that data files reside within a sub-package rather than in a subdirectory. It will also expose the package/templates subdirectory as an importable package.templates sub-package in its own right. This won't work with many existing packages which are already published using resource subdirectories instead of resource sub-packages, and it's inconvenient to add the __init__.py files everywhere muddying the boundary between data and code.
This approach was deprecated in upstream importlib_resources in 2021, and was deprecated in stdlib from version Python 3.11. bpo-45514 tracked the deprecation and migrating from legacy offers _legacy.py wrappers to aid with transition.
Honorable mention: using newer importlib_resources APIs
This has not been mentioned in any other answers yet, but importlib_resources is more than a simple backport of the Python 3.7+ importlib.resources code. It has traversable APIs which you can use like this:
import importlib_resources
my_resources = importlib_resources.files("package")
data = (my_resources / "templates" / "temp_file").read_bytes()
This works on Python 2 and 3, it works in zips, and it doesn't require spurious __init__.py files to be added in resource subdirectories. The only downside vs pkgutil that I can see is that these new APIs are only available in the stdlib for Python-3.9+, so there is still a third-party dependency needed to support older Python versions. If you only need to run on Python-3.9+ then use this approach, or you can add a compatibility layer and a conditional dependency on the backport for older Python versions:
# in your library code:
try:
from importlib.resources import files
except ImportError:
from importlib_resources import files
# in your setup.py or similar:
from setuptools import setup
setup(
...
install_requires=[
'importlib_resources; python_version < "3.9"',
]
)
Example project:
I've created an example project on github and uploaded on PyPI, which demonstrates all five approaches discussed above. Try it out with:
$ pip install resources-example
$ resources-example
See https://github.com/wimglenn/resources-example for more info.
The content in "10.8. Reading Datafiles Within a Package" of Python Cookbook, Third Edition by David Beazley and Brian K. Jones giving the answers.
I'll just get it to here:
Suppose you have a package with files organized as follows:
mypackage/
__init__.py
somedata.dat
spam.py
Now suppose the file spam.py wants to read the contents of the file somedata.dat. To do
it, use the following code:
import pkgutil
data = pkgutil.get_data(__package__, 'somedata.dat')
The resulting variable data will be a byte string containing the raw contents of the file.
The first argument to get_data() is a string containing the package name. You can
either supply it directly or use a special variable, such as __package__. The second
argument is the relative name of the file within the package. If necessary, you can navigate
into different directories using standard Unix filename conventions as long as the
final directory is still located within the package.
In this way, the package can installed as directory, .zip or .egg.
In case you have this structure
lidtk
├── bin
│   └── lidtk
├── lidtk
│   ├── analysis
│   │   ├── char_distribution.py
│   │   └── create_cm.py
│   ├── classifiers
│   │   ├── char_dist_metric_train_test.py
│   │   ├── char_features.py
│   │   ├── cld2
│   │   │   ├── cld2_preds.txt
│   │   │   └── cld2wili.py
│   │   ├── get_cld2.py
│   │   ├── text_cat
│   │   │   ├── __init__.py
│   │   │   ├── README.md <---------- say you want to get this
│   │   │   └── textcat_ngram.py
│   │   └── tfidf_features.py
│   ├── data
│   │   ├── __init__.py
│   │   ├── create_ml_dataset.py
│   │   ├── download_documents.py
│   │   ├── language_utils.py
│   │   ├── pickle_to_txt.py
│   │   └── wili.py
│   ├── __init__.py
│   ├── get_predictions.py
│   ├── languages.csv
│   └── utils.py
├── README.md
├── setup.cfg
└── setup.py
you need this code:
import pkg_resources
# __name__ in case you're within the package
# - otherwise it would be 'lidtk' in this example as it is the package name
path = 'classifiers/text_cat/README.md' # always use slash
filepath = pkg_resources.resource_filename(__name__, path)
The strange "always use slash" part comes from setuptools APIs
Also notice that if you use paths, you must use a forward slash (/) as the path separator, even if you are on Windows. Setuptools automatically converts slashes to appropriate platform-specific separators at build time
In case you wonder where the documentation is:
PEP 0365
https://packaging.python.org/guides/single-sourcing-package-version/
The accepted answer should be to use importlib.resources. pkgutil.get_data also requires the argument package be a non-namespace package (see pkgutil docs). Hence, the directory containing the resource must have an __init__.py file, making it have the exact same limitations as importlib.resources. If the overhead issue of pkg_resources is not a concern, this is also an acceptable alternative.
Pre-Python-3.3, all packages were required to have an __init__.py. Post-Python-3.3, a folder doesn't need an __init__.py to be a package. This is called a namespace package. Unfortunately, pkgutil does not work with namespace packages (see pkgutil docs).
For example, with the package structure:
+-- foo/
| +-- __init__.py
| +-- bar/
| | +-- hi.txt
where hi.txt just has Hi!, you get the following
>>> import pkgutil
>>> rsrc = pkgutil.get_data("foo.bar", "hi.txt")
>>> print(rsrc)
None
However, with an __init__.py in bar, you get
>>> import pkgutil
>>> rsrc = pkgutil.get_data("foo.bar", "hi.txt")
>>> print(rsrc)
b'Hi!'
assuming you are using an egg file; not extracted:
I "solved" this in a recent project, by using a postinstall script, that extracts my templates from the egg (zip file) to the proper directory in the filesystem. It was the quickest, most reliable solution I found, since working with __path__[0] can go wrong sometimes (i don't recall the name, but i cam across at least one library, that added something in front of that list!).
Also egg files are usually extracted on the fly to a temporary location called the "egg cache". You can change that location using an environment variable, either before starting your script or even later, eg.
os.environ['PYTHON_EGG_CACHE'] = path
However there is pkg_resources that might do the job properly.

Best directory structure for a repository with several python entry points and internal dependencies?

I'm working on a project with the following directory structure:
project/
package1/
module1.py
module2.py
package2/
module1.py
module2.py
main1.py
main2.py
main3.py
...
mainN.py
where each mainX.py file is an executable Python script that imports modules from either package1, package2, or both. package1 and package2 are subpackages meant to be distributed along with the rest of the project (not independently).
The standard thing to do is to put your entry point in the top-level directory. I have N entry points, so I put them all in the top-level directory. The trouble is that N keeps growing, so my top-level directory is getting flooded with entry points.
I could move the mainX.py files to a sub-directory (say, project/run), but then all of the package1 and package2 imports would break. I could extract package1 and package2 to a separate repository and just expect it to be installed on the system (i.e., in the system / user python path), but that would complicate installation. I could modify the Python path as a precondition or during runtime, but that's messy and could introduce unintended consequences. I could write a single main.py entry point script with argument subparsers respectively pointing to run/main1.py, ..., run/mainN.py, but that would introduce coupling between main.py and each of the run/mainX.py files.
What's the standard, "Pythonic" solution to this issue?
The standard solution is to use console_scripts packaging for your entry points - read about the entry-points specification here. This feature can be used to generate script wrappers like main1.py ... mainN.py at installation time.
Since these script wrappers are generated code, they do not exist in the project source directory at all, so that problem of clutter ("top-level directory is getting flooded with entry points") goes away.
The actual code for the scripts will be defined somewhere within the package, and the places where the main*.py scripts will actually hook into code within the package is defined in the package metadata. You can hook a console script entry-point up to any callable within the package, provided it can be called without arguments (optional arguments, i.e. args with default values, are fine).
project
├── package1
│   ├── __init__.py
│   ├── module1.py
│   └── module2.py
├── package2
│   ├── __init__.py
│   ├── module1.py
│   └── module2.py
├── pyproject.toml
└── scripts
└── __init__.py
This is the new directory structure. Note the addition of __init__.py files, which indicates that package1 and package2 are packages and not just subdirectories.
For the new files added, here's the scripts/__init__.py:
# these imports should work
# from package1 import ...
# from package2.module1 import ...
def myscript1():
# put whatever main1.py did here
print("hello")
def myscript2():
# put whatever main2.py did here
print("world")
These don't need to be all in the same file, and you can put them wherever you want within the package actually, as long as you update the hooks in the [project.scripts] section of the packaging definition.
And here's that packaging definition:
[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"
[project]
name = "mypackage"
version = "0.0.1"
[project.scripts]
"main1.py" = "scripts:myscript1"
"main2.py" = "scripts:myscript2"
[tool.setuptools]
packages = ["package1", "package2", "scripts"]
Now when the package is installed, the console scripts are generated:
$ pip install --editable .
...
Successfully installed mypackage-0.0.1
$ main1.py
hello
$ main2.py
world
As mentioned, those executables do not live in the project directory, but within the site's scripts directory, which will be present on $PATH. The scripts are generated by pip, using vendored code from distlib's ScriptMaker. If you peek at the generated script files you'll see that they're simple wrappers, they'll just import the callable from within the package and then call it. Any argument parsing, logging configuration, etc must all still be handled within the package code.
$ ls
mypackage.egg-info package1 package2 pyproject.toml scripts
$ which main2.py
/tmp/project/.venv/bin/main2.py
The exact location of the scripts directory depends on your platform, but it can be checked like this in Python:
>>> import sysconfig
>>> sysconfig.get_path("scripts")
'/tmp/project/.venv/bin'
A solution for you is to sort the entrypoints in an additional package but run them as modules and not directly by file.
project/
package1/
module1.py
module2.py
package2/
module1.py
module2.py
run/
main1.py
main2.py
main3.py
...
mainN.py
python -m run.main3
This way your current directory (hopefully the project root) will still be the one prepended to sys.path instead of the directory containing the scripts.
More canonical solutions would include
configuring export PYTHONPATH=path/to/your/project
writing a path/to/your/project line in a foobar.pth file inside the site-packages folder of your virtualenv
using a single entrypoint that features subcommands, e.g. with https://click.palletsprojects.com/en/latest/api/#click.Group

how to add non .py files into python egg

I have a flask app which looks like
my-app
│   └── src
│   └── python
│ └── config
│   └── app
│── MANIFEST.in
└── setup.py
The config folder is full of *.yaml files, I want to add all the static config files into my python egg after using
python setup.py install
My setup.py looks like
import os
from setuptools import setup, find_packages
path = os.path.dirname(os.path.abspath(__file__))
setup(
name="app",
version="1.0.0",
author="Anna",
description="",
keywords=[],
packages=find_packages(path + '/src/python'),
package_dir={'': path + '/src/python'},
include_package_data=True
)
I am trying the use the MANIFEST.in to add the config file
However, it always give error
error: Error: setup script specifies an absolute path:
/Users/Anna/Desktop/my-app/src/python/app
setup() arguments must *always* be /-separated paths relative to the
setup.py directory, *never* absolute paths.
I have not used any absolute paths in my code, I've seen other posts trying to bypass this error, by removing
include_package_data=True
However, in my case, if i do this to avoid this error, all my yamls won't be added.
I was wondering if there are ways to fix this problem. Thanks

Confused about the package_dir and packages settings in setup.py

Here is my project directory structure, which includes the project folder, plus
a "framework" folder containing packages and modules shared amongst several projects
which resides at the same level in the hierarchy as the project folders:
Framework/
package1/
__init__.py
mod1.py
mod2.py
package2/
__init__.py
moda.py
modb.py
My_Project/
src/
main_package/
__init__.py
main_module.py
setup.py
README.txt
Here is a partial listing of the contents of my setup.py file:
from distutils.core import setup
setup(packages=[
'package1',
'package2.moda',
'main_package'
],
package_dir={
'package1': '../Framework/package1',
'package2.moda': '../Framework/package2',
'main_package': 'src/main_package'
})
Here are the issues:
No dist or build directories are created
Manifest file is created, but all modules in package2 are listed, not just the moda.py module
The build terminates with an error:
README.txt: Incorrect function
I don't know if I have a single issue (possibly related to my directory structure) or if I have multiple issues but I've read everything I can find on distribution of Python applications, and I'm stumped.
If I understand correctly, the paths in package_dir should stop at the parent directory of the directories which are Python packages. In other words, try this:
package_dir={'package1': '../Framework',
'package2': '../Framework',
'main_package': 'src'})
I've had a similar problem, which was solved through the specification of the root folder and of the packages inside that root.
My package has the following structure:
.
├── LICENSE
├── README.md
├── setup.py
└── src
└── common
├── __init__.py
├── persistence.py
├── schemas.py
└── utils.py
The setup.py contains the package_dir and packages line:
package_dir={"myutils": "src"},
packages=['myutils.common'],
After running the python setup.py bdist_wheel and installing the .whl file, the package can be called using:
import myutils.common

How to read a (static) file from inside a Python package?

Could you tell me how can I read a file that is inside my Python package?
My situation
A package that I load has a number of templates (text files used as strings) that I want to load from within the program. But how do I specify the path to such file?
Imagine I want to read a file from:
package\templates\temp_file
Some kind of path manipulation? Package base path tracking?
TLDR; Use standard-library's importlib.resources module as explained in the method no 2, below.
The traditional pkg_resources from setuptools is not recommended anymore because the new method:
it is significantly more performant;
is is safer since the use of packages (instead of path-stings) raises compile-time errors;
it is more intuitive because you don't have to "join" paths;
it is faster when developing since you don't need an extra dependency (setuptools), but rely on Python's standard-library alone.
I kept the traditional listed first, to explain the differences with the new method when porting existing code (porting also explained here).
Let's assume your templates are located in a folder nested inside your module's package:
<your-package>
+--<module-asking-the-file>
+--templates/
+--temp_file <-- We want this file.
Note 1: For sure, we should NOT fiddle with the __file__ attribute (e.g. code will break when served from a zip).
Note 2: If you are building this package, remember to declatre your data files as package_data or data_files in your setup.py.
1) Using pkg_resources from setuptools(slow)
You may use pkg_resources package from setuptools distribution, but that comes with a cost, performance-wise:
import pkg_resources
# Could be any dot-separated package/module name or a "Requirement"
resource_package = __name__
resource_path = '/'.join(('templates', 'temp_file')) # Do not use os.path.join()
template = pkg_resources.resource_string(resource_package, resource_path)
# or for a file-like stream:
template = pkg_resources.resource_stream(resource_package, resource_path)
Tips:
This will read data even if your distribution is zipped, so you may set zip_safe=True in your setup.py, and/or use the long-awaited zipapp packer from python-3.5 to create self-contained distributions.
Remember to add setuptools into your run-time requirements (e.g. in install_requires`).
... and notice that according to the Setuptools/pkg_resources docs, you should not use os.path.join:
Basic Resource Access
Note that resource names must be /-separated paths and cannot be absolute (i.e. no leading /) or contain relative names like "..". Do not use os.path routines to manipulate resource paths, as they are not filesystem paths.
2) Python >= 3.7, or using the backported importlib_resources library
Use the standard library's importlib.resources module which is more efficient than setuptools, above:
try:
import importlib.resources as pkg_resources
except ImportError:
# Try backported to PY<37 `importlib_resources`.
import importlib_resources as pkg_resources
from . import templates # relative-import the *package* containing the templates
template = pkg_resources.read_text(templates, 'temp_file')
# or for a file-like stream:
template = pkg_resources.open_text(templates, 'temp_file')
Attention:
Regarding the function read_text(package, resource):
The package can be either a string or a module.
The resource is NOT a path anymore, but just the filename of the resource to open, within an existing package; it may not contain path separators and it may not have sub-resources (i.e. it cannot be a directory).
For the example asked in the question, we must now:
make the <your_package>/templates/ into a proper package, by creating an empty __init__.py file in it,
so now we can use a simple (possibly relative) import statement (no more parsing package/module names),
and simply ask for resource_name = "temp_file" (no path).
Tips:
To access a file inside the current module, set the package argument to __package__, e.g. pkg_resources.read_text(__package__, 'temp_file') (thanks to #ben-mares).
Things become interesting when an actual filename is asked with path(), since now context-managers are used for temporarily-created files (read this).
Add the backported library, conditionally for older Pythons, with install_requires=[" importlib_resources ; python_version<'3.7'"] (check this if you package your project with setuptools<36.2.1).
Remember to remove setuptools library from your runtime-requirements, if you migrated from the traditional method.
Remember to customize setup.py or MANIFEST to include any static files.
You may also set zip_safe=True in your setup.py.
A packaging prelude:
Before you can even worry about reading resource files, the first step is to make sure that the data files are getting packaged into your distribution in the first place - it is easy to read them directly from the source tree, but the important part is making sure these resource files are accessible from code within an installed package.
Structure your project like this, putting data files into a subdirectory within the package:
.
├── package
│   ├── __init__.py
│   ├── templates
│   │   └── temp_file
│   ├── mymodule1.py
│   └── mymodule2.py
├── README.rst
├── MANIFEST.in
└── setup.py
You should pass include_package_data=True in the setup() call. The manifest file is only needed if you want to use setuptools/distutils and build source distributions. To make sure the templates/temp_file gets packaged for this example project structure, add a line like this into the manifest file:
recursive-include package *
Historical cruft note: Using a manifest file is not needed for modern build backends such as flit, poetry, which will include the package data files by default. So, if you're using pyproject.toml and you don't have a setup.py file then you can ignore all the stuff about MANIFEST.in.
Now, with packaging out of the way, onto the reading part...
Recommendation:
Use standard library pkgutil APIs. It's going to look like this in library code:
# within package/mymodule1.py, for example
import pkgutil
data = pkgutil.get_data(__name__, "templates/temp_file")
It works in zips. It works on Python 2 and Python 3. It doesn't require third-party dependencies. I'm not really aware of any downsides (if you are, then please comment on the answer).
Bad ways to avoid:
Bad way #1: using relative paths from a source file
This is currently the accepted answer. At best, it looks something like this:
from pathlib import Path
resource_path = Path(__file__).parent / "templates"
data = resource_path.joinpath("temp_file").read_bytes()
What's wrong with that? The assumption that you have files and subdirectories available is not correct. This approach doesn't work if executing code which is packed in a zip or a wheel, and it may be entirely out of the user's control whether or not your package gets extracted to a filesystem at all.
Bad way #2: using pkg_resources APIs
This is described in the top-voted answer. It looks something like this:
from pkg_resources import resource_string
data = resource_string(__name__, "templates/temp_file")
What's wrong with that? It adds a runtime dependency on setuptools, which should preferably be an install time dependency only. Importing and using pkg_resources can become really slow, as the code builds up a working set of all installed packages, even though you were only interested in your own package resources. That's not a big deal at install time (since installation is once-off), but it's ugly at runtime.
Bad way #3: using legacy importlib.resources APIs
This is currently the recommendation in the top-voted answer. It's in the standard library since Python 3.7. It looks like this:
from importlib.resources import read_binary
data = read_binary("package.templates", "temp_file")
What's wrong with that? Well, unfortunately, the implementation left some things to be desired and it is likely to be was deprecated in Python 3.11. Using importlib.resources.read_binary, importlib.resources.read_text and friends will require you to add an empty file templates/__init__.py so that data files reside within a sub-package rather than in a subdirectory. It will also expose the package/templates subdirectory as an importable package.templates sub-package in its own right. This won't work with many existing packages which are already published using resource subdirectories instead of resource sub-packages, and it's inconvenient to add the __init__.py files everywhere muddying the boundary between data and code.
This approach was deprecated in upstream importlib_resources in 2021, and was deprecated in stdlib from version Python 3.11. bpo-45514 tracked the deprecation and migrating from legacy offers _legacy.py wrappers to aid with transition.
Honorable mention: using newer importlib_resources APIs
This has not been mentioned in any other answers yet, but importlib_resources is more than a simple backport of the Python 3.7+ importlib.resources code. It has traversable APIs which you can use like this:
import importlib_resources
my_resources = importlib_resources.files("package")
data = (my_resources / "templates" / "temp_file").read_bytes()
This works on Python 2 and 3, it works in zips, and it doesn't require spurious __init__.py files to be added in resource subdirectories. The only downside vs pkgutil that I can see is that these new APIs are only available in the stdlib for Python-3.9+, so there is still a third-party dependency needed to support older Python versions. If you only need to run on Python-3.9+ then use this approach, or you can add a compatibility layer and a conditional dependency on the backport for older Python versions:
# in your library code:
try:
from importlib.resources import files
except ImportError:
from importlib_resources import files
# in your setup.py or similar:
from setuptools import setup
setup(
...
install_requires=[
'importlib_resources; python_version < "3.9"',
]
)
Example project:
I've created an example project on github and uploaded on PyPI, which demonstrates all five approaches discussed above. Try it out with:
$ pip install resources-example
$ resources-example
See https://github.com/wimglenn/resources-example for more info.
The content in "10.8. Reading Datafiles Within a Package" of Python Cookbook, Third Edition by David Beazley and Brian K. Jones giving the answers.
I'll just get it to here:
Suppose you have a package with files organized as follows:
mypackage/
__init__.py
somedata.dat
spam.py
Now suppose the file spam.py wants to read the contents of the file somedata.dat. To do
it, use the following code:
import pkgutil
data = pkgutil.get_data(__package__, 'somedata.dat')
The resulting variable data will be a byte string containing the raw contents of the file.
The first argument to get_data() is a string containing the package name. You can
either supply it directly or use a special variable, such as __package__. The second
argument is the relative name of the file within the package. If necessary, you can navigate
into different directories using standard Unix filename conventions as long as the
final directory is still located within the package.
In this way, the package can installed as directory, .zip or .egg.
In case you have this structure
lidtk
├── bin
│   └── lidtk
├── lidtk
│   ├── analysis
│   │   ├── char_distribution.py
│   │   └── create_cm.py
│   ├── classifiers
│   │   ├── char_dist_metric_train_test.py
│   │   ├── char_features.py
│   │   ├── cld2
│   │   │   ├── cld2_preds.txt
│   │   │   └── cld2wili.py
│   │   ├── get_cld2.py
│   │   ├── text_cat
│   │   │   ├── __init__.py
│   │   │   ├── README.md <---------- say you want to get this
│   │   │   └── textcat_ngram.py
│   │   └── tfidf_features.py
│   ├── data
│   │   ├── __init__.py
│   │   ├── create_ml_dataset.py
│   │   ├── download_documents.py
│   │   ├── language_utils.py
│   │   ├── pickle_to_txt.py
│   │   └── wili.py
│   ├── __init__.py
│   ├── get_predictions.py
│   ├── languages.csv
│   └── utils.py
├── README.md
├── setup.cfg
└── setup.py
you need this code:
import pkg_resources
# __name__ in case you're within the package
# - otherwise it would be 'lidtk' in this example as it is the package name
path = 'classifiers/text_cat/README.md' # always use slash
filepath = pkg_resources.resource_filename(__name__, path)
The strange "always use slash" part comes from setuptools APIs
Also notice that if you use paths, you must use a forward slash (/) as the path separator, even if you are on Windows. Setuptools automatically converts slashes to appropriate platform-specific separators at build time
In case you wonder where the documentation is:
PEP 0365
https://packaging.python.org/guides/single-sourcing-package-version/
The accepted answer should be to use importlib.resources. pkgutil.get_data also requires the argument package be a non-namespace package (see pkgutil docs). Hence, the directory containing the resource must have an __init__.py file, making it have the exact same limitations as importlib.resources. If the overhead issue of pkg_resources is not a concern, this is also an acceptable alternative.
Pre-Python-3.3, all packages were required to have an __init__.py. Post-Python-3.3, a folder doesn't need an __init__.py to be a package. This is called a namespace package. Unfortunately, pkgutil does not work with namespace packages (see pkgutil docs).
For example, with the package structure:
+-- foo/
| +-- __init__.py
| +-- bar/
| | +-- hi.txt
where hi.txt just has Hi!, you get the following
>>> import pkgutil
>>> rsrc = pkgutil.get_data("foo.bar", "hi.txt")
>>> print(rsrc)
None
However, with an __init__.py in bar, you get
>>> import pkgutil
>>> rsrc = pkgutil.get_data("foo.bar", "hi.txt")
>>> print(rsrc)
b'Hi!'
assuming you are using an egg file; not extracted:
I "solved" this in a recent project, by using a postinstall script, that extracts my templates from the egg (zip file) to the proper directory in the filesystem. It was the quickest, most reliable solution I found, since working with __path__[0] can go wrong sometimes (i don't recall the name, but i cam across at least one library, that added something in front of that list!).
Also egg files are usually extracted on the fly to a temporary location called the "egg cache". You can change that location using an environment variable, either before starting your script or even later, eg.
os.environ['PYTHON_EGG_CACHE'] = path
However there is pkg_resources that might do the job properly.

Categories

Resources