Way to access resource files in python

Way to access resource files in python - python

What is the proper way to access resources in python programs.
Basically in many of my python modules I end up writing code like that:
DIRNAME = os.path.split(__file__)[0]
(...)
template_file = os.path.join(DIRNAME, "template.foo")
Which is OK but:
It will break if I will start to use python zip packages
It is boilerplate code
In Java I had a function that did exactly the same --- but worked both when code was lying in bunch of folders and when it was packaged in .jar file.
Is there such function in Python, or is there any other pattern that I might use.

You'll want to look at using either get_data in the stdlib or pkg_resources from setuptools/distribute. Which one you use probably depends on whether you're already using distribute to package your code as an egg.

Since version 3.7 of Python, the proper way to access a file in resources is to use the importlib.resources library.
One can, for example, use the path function to access a particular file in a Python package:
import importlib.resources
with importlib.resources.path("your.package.templates", "template.foo") as template_file:
...
Starting with Python 3.9, this package introduced the files() API, to be preferred over the legacy API.
One can, use the files function to access a particular file in a Python package:
template_res = importlib.resources.files("your.package.templates").joinpath("template.foo")
with importlib.resources.as_file(template_res) as template_file:
...
For older versions, I recommend to install and use the importlib-resources library. The documentation also explains in detail how to migrate your old implementation using pkg_resources to importlib-resources.

Trying to understand how we could combine the two aspect togather
Loading for resources in native filesystem
Packaged in zipped files
Reading through the quick tutorial on zipimport : http://www.doughellmann.com/PyMOTW/zipimport/
I see the following example:
import sys
sys.path.insert(0, 'zipimport_example.zip')
import os
import zipimport
importer = zipimport.zipimporter('zipimport_example.zip')
module = importer.load_module('example_package')
print module.__file__
print module.__loader__.get_data('example_package/README.txt')
I think that output of __file__ is "zipimport_example.zip/example_package/__init__.pyc"
Need to check how it looks from inside.
But then we could always do something like this:
if ".zip" in example_package.__file__:
...
load using get_data
else:
load by building the correct file path
[Edit:] I have tried to work out the example a bit better.
If the the package gets imported as zipped file then, two things happen
__file__ contains ".zip" in it's path.
__loader__ is available in the name space
If these two conditions are met then within the package you could do:
print __loader__.get_data(os.path.join('package_name','README.txt'))
else the module was loaded normally and you can follow the regular approach to loading the file.

I guess the zipimport standard python module could be an answer...
EDIT: well, not the use of the module directly, but using sys.path as shown in the example could be a good way:
I have a zip file test.zip with one python module test and a file test.foo inside
to test that for the zipped python module test can be aware of of test.foo, it contains this code:
c
import os
DIRNAME = os.path.dirname(__file__)
if os.path.exists(os.path.join(DIRNAME, 'test.foo')):
print 'OK'
else:
print 'KO'
Test looks ok:
>>> import sys
>>> sys.path.insert(0, r'D:\DATA\FP12210\My Documents\Outils\SVN\05_impl\2_tools\test.zip')
>>> import test
OK
>>>
So a solution could be to loop in your zip file to retrieve all python modules, and add them in sys.path; this piece of code would be ideally the 1st one loaded by your application.

Related

Is this the approved way to acess data adjacent to/packaged with a Python script?

I have a Python script that needs some data that's stored in a file that will always be in the same location as the script. I have a setup.py for the script, and I want to make sure it's pip installable in a wide variety of environments, and can be turned into a standalone executable if necessary.
Currently the script runs with Python 2.7 and Python 3.3 or higher (though I don't have a test environment for 3.3 so I can't be sure about that).
I came up with this method to get the data. This script isn't part of a module directory with __init__.py or anything, it's just a standalone file that will work if just run with python directly, but also has an entry point defined in the setup.py file. It's all one file. Is this the correct way?
def fetch_wordlist():
wordlist = 'wordlist.txt'
try:
import importlib.resources as res
return res.read_binary(__file__, wordlist)
except ImportError:
pass
try:
import pkg_resources as resources
req = resources.Requirement.parse('makepw')
wordlist = resources.resource_filename(req, wordlist)
except ImportError:
import os.path
wordlist = os.path.join(os.path.dirname(__file__), wordlist)
with open(wordlist, 'rb') as f:
return f.read()
This seems ridiculously complex. Also, it seems to rely on the package management system in ways I'm uncomfortable with. The script no longer works unless it's been pip-installed, and that also doesn't seem desirable.

Resources living on the filesystem
The standard way to read a file adjacent to your python script would be:
a) If you've got python>=3.4 I'd suggest you use the pathlib module, like this:
from pathlib import Path
def fetch_wordlist(filename="wordlist.txt"):
return (Path(__file__).parent / filename).read_text()
if __name__ == '__main__':
print(fetch_wordlist())
b) And if you're still using a python version <3.4 or you still want to use the good old os.path module you should do something like this:
import os
def fetch_wordlist(filename="wordlist.txt"):
with open(os.path.join(os.path.dirname(__file__), filename)) as f:
return f.read()
if __name__ == '__main__':
print(fetch_wordlist())
Also, I'd suggest you capture exceptions in the outer callers, the above methods are standard way to read files in python so you don't need wrap them in a function like fetch_wordlist, said otherwise, reading files in python is an "atomic" operation.
Now, it may happen that you've frozen your program using some freezer such as cx_freeze, pyinstaller or similars... in that case you'd need to detect that, here's a simple way to check it out:
a) using os.path:
if getattr(sys, 'frozen', False):
app_path = os.path.dirname(sys.executable)
elif __file__:
app_path = os.path.dirname(__file__)
b) using pathlib:
if getattr(sys, 'frozen', False):
app_path = Path(sys.executable).parent
elif __file__:
app_path = Path(__file__).parent
Resources living inside a zip file
The above solutions would work if the code lives on the file system but it wouldn't work if the package is living inside a zip file, when that happens you could use either importlib.resources (new in version 3.7) or pkg_resources combo as you've already shown in the question (or you could wrap up in some helpers) or you could use a nice 3rd party library called importlib_resources that should work with the old&modern python versions:
pypi: https://pypi.org/project/importlib_resources/
documentation: https://importlib-resources.readthedocs.io/en/latest/
Specifically for your particular problem I'd suggest you take a look to this https://importlib-resources.readthedocs.io/en/latest/using.html#file-system-or-zip-file.
If you want to know what that library is doing behind the curtains because you're not willing to install any 3rd party library you can find the code for py2 here and py3 here in case you wanted to get the relevant bits for your particular problem

I'm going to go out on a limb and make an assumption because it may drastically simplify your problem. The only way I can imagine that you can claim that this data is "stored in a file that will always be in the same location as the script" is because you created this data, once, and put it in a file in the source code directory. Even though this data is binary, have you considered making the data a literal byte-string in a python file, and then simply importing it as you would anything else?

You're right about the fact that your method of reading a file is a bit unnecessarily complex. Unless you have got a really specific reason to use the importlib and pkg_resources modules, it's rather simple.
import os
def fetch_wordlist():
if not os.path.exists('wordlist.txt'):
raise FileNotFoundError
with open('wordlist.txt', 'rb') as wordlist:
return wordlist.read()
You haven't given much information regarding your script, so I cannot comment on why it doesn't work unless it's installed using pip. My best guess: your script is probably packed into a python package.

How to pass entire directory into python via command line and sys.argv

So in the past when I've used a unix server to do my python development if I wanted to pass in an entire folder or directory, I would just put an asterisk() on the end of it. An example would be something like users/wilkenia/shakespeare/ to pass in a set of files containing each of shakespeare's plays. Is there a way to do this in windows? I've tried putting in C:\Users\Alexander\Desktop\coding-data-exam\wx_data* and the same with the disk name removed. Nothing has worked so far, in fact, it takes in the directory as an argument itself.
Edit: implemented glob, getting a permissions error, even though I'm running as administrator. Here's my code if anyone wants to have a look.

For the sake of showing how you can use pathlib to achieve this result. You can do something like this:
some_script.py:
from pathlib import Path
path = Path(sys.argv[1])
glob_path = path.glob('*')
for file_path in glob_path:
print(file_path)
Demo:
python some_script.py C:/some/path/
Output:
C:/some/path/depth_1.txt
C:/some/path/dude.docx
C:/some/path/dude.py
C:/some/path/dude_bock.txt
The nice thing about pathlib, is that it takes an object oriented approach to help work with the filesystem easier.
Note: pathlib is available out-of-the-box from Python 3.4 and above. If you are using an older version of Python, you will need to use the backported package that you can get from pypi: here
Simply: pip install pathlib2

You can use the glob module, it does exactly this.
A quick demo:
In [81]: import glob
In [82]: glob.glob('*')
Out[82]:
[
... # a bunch of my personal files from my cwd
]
If you want to extend this for your use case, you'll need to do something along the lines of:
import sys
import glob
arg = sys.argv[1]
for file in glob.glob(arg):
....
You'll read your args with sys.argv and pass it onto glob.

Making Python guess a file Name

I have the following function:
unpack_binaryfunction('third-party/jdk-6u29-linux-i586.bin' , ('/home/user/%s/third-party' % installdir), 'jdk1.6.0_29')
Which uses os.sys to execute a java deployment. The line, combined with the function (Which is unimportant, it just calls some linux statements) works perfectly.
However, this only works if in the 'third-party' folder is specificaly that version of the jdk.
Therefore I need a code that will look at the files in the 'third-party' folder and find one that starts with 'jdk' and fill out the rest of the filename itself.
I am absolutely stuck. Are there any functions or libraries that can help with file searching etc?
To clarify: I need the code to not include the entire: jdk-6u29-linux-i586.bin but to use the jdk-xxxx... that will be in the third-party folder.

This can easily be done using the glob module, and then a bit a string parsing to extract the version.
import glob
import os.path
for path in glob.glob('third-party/jdk-*'):
parent, name = os.path.split(path) # "third-party", "jdk-6u29-linux-i586.bin"
version, update = name.split('-')[1].split('u') # ("6", "29")
unpack_binaryfunction(path, ('/home/user/%s/third-party' % installdir), 'jdk1.{}.0_{}'.format(version, update))

How to get a list of all the Python standard library modules?

I want something like sys.builtin_module_names except for the standard library. Other things that didn't work:
sys.modules - only shows modules that have already been loaded
sys.prefix - a path that would include non-standard library modules and doesn't seem to work inside a virtualenv.
The reason I want this list is so that I can pass it to the --ignore-module or --ignore-dir command line options of trace.
So ultimately, I want to know how to ignore all the standard library modules when using trace or sys.settrace.

I brute forced it by writing some code to scrape the TOC of the Standard Library page in the official Python docs. I also built a simple API for getting a list of standard libraries (for Python version 2.6, 2.7, 3.2, 3.3, and 3.4).
The package is here, and its usage is fairly simple:
>>> from stdlib_list import stdlib_list
>>> libraries = stdlib_list("2.7")
>>> libraries[:10]
['AL', 'BaseHTTPServer', 'Bastion', 'CGIHTTPServer', 'ColorPicker', 'ConfigParser', 'Cookie', 'DEVICE', 'DocXMLRPCServer', 'EasyDialogs']

Why not work out what's part of the standard library yourself?
import distutils.sysconfig as sysconfig
import os
std_lib = sysconfig.get_python_lib(standard_lib=True)
for top, dirs, files in os.walk(std_lib):
for nm in files:
if nm != '__init__.py' and nm[-3:] == '.py':
print os.path.join(top, nm)[len(std_lib)+1:-3].replace(os.sep, '.')
gives
abc
aifc
antigravity
--- a bunch of other files ----
xml.parsers.expat
xml.sax.expatreader
xml.sax.handler
xml.sax.saxutils
xml.sax.xmlreader
xml.sax._exceptions
Edit: You'll probably want to add a check to avoid site-packages if you need to avoid non-standard library modules.

Python >= 3.10:
sys.stdlib_module_names
Python < 3.10:
The author of isort, a tool which cleans up imports, had to grapple this same problem in order to satisfy the pep8 requirement that core library imports should be ordered before third party imports.
I have been using this tool and it seems to be working well. You can use the method place_module in the file isort.py:
>>> from isort import place_module
>>> place_module("json")
'STDLIB'
>>> place_module("requests")
'THIRDPARTY'
Or you can get a set of module names directly, which is depending on Python version, for example:
>>> from isort.stdlibs.py39 import stdlib
>>> for name in sorted(stdlib): print(name)
... <200+ lines>
xml
xmlrpc
zipapp
zipfile
zipimport
zlib
zoneinfo

Take a look at this,
https://docs.python.org/3/py-modindex.html
They made an index page for the standard modules.

On Python 3.10 there is now sys.stdlib_module_names.

Here's an improvement on Caspar's answer, which is not cross-platform, and misses out top-level modules (e.g. email), dynamically loaded modules (e.g. array), and core built-in modules (e.g. sys):
import distutils.sysconfig as sysconfig
import os
import sys
std_lib = sysconfig.get_python_lib(standard_lib=True)
for top, dirs, files in os.walk(std_lib):
for nm in files:
prefix = top[len(std_lib)+1:]
if prefix[:13] == 'site-packages':
continue
if nm == '__init__.py':
print top[len(std_lib)+1:].replace(os.path.sep,'.')
elif nm[-3:] == '.py':
print os.path.join(prefix, nm)[:-3].replace(os.path.sep,'.')
elif nm[-3:] == '.so' and top[-11:] == 'lib-dynload':
print nm[0:-3]
for builtin in sys.builtin_module_names:
print builtin
This is still not perfect because it will miss things like os.path which is defined from within os.py in a platform-dependent manner via code such as import posixpath as path, but it's probably as good as you'll get, bearing in mind that Python is a dynamic language and you can't ever really know which modules are defined until they're actually defined at runtime.

This will get you close:
import sys; import glob
glob.glob(sys.prefix + "/lib/python%d.%d" % (sys.version_info[0:2]) + "/*.py")
Another possibility for the ignore-dir option:
os.pathsep.join(sys.path)

This isn't perfect, but should get you pretty close if you can't run 3.10:
import os
import distutils.sysconfig
def get_stdlib_module_names():
stdlib_dir = distutils.sysconfig.get_python_lib(standard_lib=True)
return {f.replace(".py", "") for f in os.listdir(stdlib_dir)}
This misses some modules such as sys, math, time, and itertools.
My use case is logging which modules were imported during an app run, so having a rough filter for stdlib modules is fine. Also I return it as a set rather than a list so membership checks are faster.

Building on #Edmund's answer, this solution pulls the list from the official website:
def standard_libs(version=None, top_level_only=True):
import re
from urllib.request import urlopen
if version is None:
import sys
version = sys.version_info
version = f"{version.major}.{version.minor}"
url = f"https://docs.python.org/{version}/py-modindex.html"
with urlopen(url) as f:
page = f.read()
modules = set()
for module in re.findall(r'#module-(.*?)[\'"]',
page.decode('ascii', 'replace')):
if top_level_only:
module = module.split(".")[0]
modules.add(module)
return modules
It returns a set. For example, here are the modules that were added between 3.5 and 3.10:
>>> standard_libs("3.10") - standard_libs("3.5")
{'contextvars', 'dataclasses', 'graphlib', 'secrets', 'zoneinfo'}
Since this is based on the official documentation, it doesn't include undocumented modules, such as:
Easter eggs, namely this and antigravity
Internal modules, such as genericpath, posixpath or ntpath, which are not supposed to be used directly (you should use os.path instead). Other internal modules: idlelib (which implements the IDLE editor), opcode, sre_constants, sre_compile, sre_parse, pyexpat, pydoc_data, nt.
All modules with a name starting with an underscore (which are also internal), except for __main__', '_thread', and '__future__ which are public and documented.
If you're concerned that the website may be down, you can just cache the list locally. For example, you can use the following function to create a small Python module containing all the module names:
def create_stdlib_module_names(
module_name="stdlib_module_names",
variable="stdlibs",
version=None,
top_level_only=True):
stdlibs = standard_libs(
version=version, top_level_only=top_level_only)
with open(f"{module_name}.py", "w") as f:
f.write(f"{variable} = {stdlibs!r}\n")
Here's how to use it:
>>> create_stdlib_module_names() # run this just once
>>> from stdlib_module_names import stdlibs
>>> len(stdlibs)
207
>>> "collections" in stdlibs
True
>>> "numpy" in stdlibs
False

This works on Anaconda on Windows, and I suspect it will work on Linux distros.
It goes to your Anaconda directory, e.g.:
C:\Users\{user}\anaconda3\Lib, where standard libraries are installed. It then pulls folder names and filenames (dropping extensions).
import sys
import os
standard_libs = []
standard_lib_path = os.path.join(sys.prefix, "Lib")
for file in os.listdir(standard_lib_path):
standard_libs.append(file.split(".py")[0].strip().lower())
NB: Builtins, viewable via print(dir(__builtins__)), are automatically loaded, whereas standard libs are not.

Python - Importing a global/site-packages module rather than the file of the same name in the local directory

I'm using python and virtualenv/pip. I have a module installed via pip called test_utils (it's django-test-utils). Inside one of my django apps, I want to import that module. However I also have another file test_utils.py in the same directory. If I go import test_utils, then it will import this local file.
Is it possible to make python use a non-local / non-relative / global import? I suppose I can just rename my test_utils.py, but I'm curious.

You can switch the search order by changing sys.path:
del sys.path[0]
sys.path.append('')
This puts the current directory after the system search path, so local files won't shadow standard modules.

My problem was even more elaborate:
importing a global/site-packages module from a file with the same name
Working on aero the pm recycler I wanted access to the pip api, in particular pip.commands.search.SearchCommand from my adapter class Pip in source file pip.py.
In this case trying to modify sys.path is useless, I even went as far as wiping sys.path completely and adding the folder .../site-packages/pip...egg/ as the only item in sys.path and no luck.
I would still get:
print pip.__package__
# 'aero.adapters'
I found two options that did eventually work for me, they should work equally well for you:
using __builtin__.__import__() the built-in function
global_pip = __import__('pip.commands.search', {}, {}, ['SearchCommand'], -1)
SearchCommand = global_pip.SearchCommand
Reading the documentation though, suggests using the following method instead.
using importlib.import_module() the __import__ conv wrapper.
The documentation explains that import_module() is a minor subset of functionality from Python 3.1 to help ease transitioning from 2.7 to 3.1
from importlib import import_module
SearchCommand = import_module('pip.commands.search').SearchCommand
Both options get the job done while import_module() definitely feels more Pythonic if you ask me, would you agree?
nJoy!

I was able to force python to import the global one with
from __future__ import absolute_import
at the beginning of the file (this is the default in python 3.0)

You could reset your sys.path:
import sys
first = sys.path[0]
sys.path = sys.path[1:]
import test_utils
sys.path = first + sys.path
The first entry of sys.path is "always" (as in "per default": See python docs) the current directory, so if you remove it you will do a global import.

Since my test_utils was in a django project, I was able to go from ..test_utils import ... to import the global one.

Though, in first place, I would always consider keeping the name of local file not matching with any global module name, an easy workaround, without modifying 'sys.path' can be to include global module in some other file and then import this global module from that file.
Remember, this file must be in some other folder then in the folder where file with name matching with global module is.
For example.
./project/root/workarounds/global_imports.py
import test_utils as tutil
and then in
./project/root/mycode/test_utils.py
from project.root.workarounds.global_imports import tutil
# tutil is global test_utils
# you can also do
from project.root.workarounds.global_imports import test_utils

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Way to access resource files in python - python

You'll want to look at using either get_data in the stdlib or pkg_resources from setuptools/distribute. Which one you use probably depends on whether you're already using distribute to package your code as an egg.

Related

Is this the approved way to acess data adjacent to/packaged with a Python script?

How to pass entire directory into python via command line and sys.argv

Making Python guess a file Name

How to get a list of all the Python standard library modules?

Python - Importing a global/site-packages module rather than the file of the same name in the local directory

Categories

Resources