How should a Python module be structured?

How should a Python module be structured? - python

This is my first time posting on stack overflow, so I apologize if I do something wrong.
I am trying to understand the best way to structure a Python module. As an example, I made a backup module that syncs the source and destination, only copying files if there are differences between source and destination. The backup module contains only a class named Backup.
Now I was taught that OOP is the greatest thing ever, but this seems wrong. After looking through some of the standard library source, I see that most everything isn't broken out into a class. I tried to do some research on this to determine when I should use a class and when I should just have functions and I got varying info. I guess my main question is, should the following code be left as a class or should it just be a module with functions. It is very simple currently, but I may want to add more in the future.
"""Class that represents a backup event."""
import hashlib
import os
import shutil
class Backup:
def __init__(self, source, destination):
self.source = source
self.destination = destination
def sync(self):
"""Synchronizes root of source and destination paths."""
sroot = os.path.normpath(self.source)
droot = os.path.normpath(self.destination) + '/' + os.path.basename(sroot)
if os.path.isdir(sroot) and os.path.isdir(droot):
Backup.sync_helper(sroot, droot)
elif os.path.isfile(sroot) and os.path.isfile(droot):
if not Backup.compare(sroot, droot):
Backup.copy(sroot, droot)
else:
Backup.copy(sroot, droot)
def sync_helper(source, destination):
"""Synchronizes source and destination."""
slist = os.listdir(source)
dlist = os.listdir(destination)
for s in slist:
scurr = source + '/' + s
dcurr = destination + '/' + s
if os.path.isdir(scurr) and os.path.isdir(dcurr):
Backup.sync_helper(scurr, dcurr)
elif os.path.isfile(scurr) and os.path.isfile(dcurr):
if not Backup.compare(scurr, dcurr):
Backup.copy(scurr, dcurr)
else:
Backup.copy(scurr, dcurr)
for d in dlist:
if d not in slist:
Backup.remove(destination + '/' + d)
def copy(source, destination):
"""Copies source file, directory, or symlink to destination"""
if os.path.isdir(source):
shutil.copytree(source, destination, symlinks=True)
else:
shutil.copy2(source, destination)
def remove(path):
"""Removes file, directory, or symlink located at path"""
if os.path.isdir(path):
shutil.rmtree(path)
else:
os.unlink(path)
def compare(source, destination):
"""Compares the SHA512 hash of source and destination."""
blocksize = 65536
shasher = hashlib.sha512()
dhasher = hashlib.sha512()
while open(source, 'rb') as sfile:
buf = sfile.read(blocksize)
while len(buf) > 0:
shasher.update(buf)
buf = sfile.read(blocksize)
while open(destination, 'rb') as dfile:
buf = dfile.read(blocksize)
while len(buf) > 0:
dhasher.update(buf)
buf = dfile.read(blocksize)
if shasher.digest() == dhasher.digest():
return True
else:
return False
I guess it doesn't really make sense as a class, since the only method is sync. On the other hand a backup is a real world object. This really confuses me.
As some side questions. My sync method and sync_helper function seem very similar and it is probably possible to collapse the two somehow (I will leave that as an exercise for myself), but is this generally how this is done when using a recursive function that needs a certain initial state. Meaning, is it ok to do some stuff in one function to reach a certain state and then call the recursive function that does the actual thing. This seems messy.
Finally, I have a bunch of utility functions that aren't actually part of the object, but are used by sync. Would it make more sense to break these out into like a utility submodule or something as to not cause confusion?
Structuring my programs is the most confusing thing to me right now, any help would be greatly appreciated.

This method (and several others) is wrong:
def copy(source, destination):
"""Copies source file, directory, or symlink to destination"""
It works the way it was used (Backup.copy(scurr, dcurr)), but it does not work when used on an instance.
All methods in Python should take self as the first positional argument: def copy(self, source, destination), or should be turned into staticmethod (or moved out of the class).
A static method is declared using the staticmethod decorator:
#staticmethod
def copy(source, destination):
"""Copies source file, directory, or symlink to destination"""
But in this case, source and destination are actually attributes of Backup instances, so probably it should be modified to use attributes:
def copy(self):
# use self.source and self.destination

Related

Is it possible to mock os.scandir and its attributes?

for entry in os.scandir(document_dir)
if os.path.isdir(entry):
# some code goes here
else:
# else the file needs to be in a folder
file_path = entry.path.replace(os.sep, '/')
I am having trouble mocking os.scandir and the path attribute within the else statement. I am not able to mock the mock object's property I created in my unit tests.
with patch("os.scandir") as mock_scandir:
# mock_scandir.return_value = ["docs.json", ]
# mock_scandir.side_effect = ["docs.json", ]
# mock_scandir.return_value.path = PropertyMock(return_value="docs.json")
These are all the options I've tried. Any help is greatly appreciated.

It depends on what you realy need to mock. The problem is that os.scandir returns entries of type os.DirEntry. One possibility is to use your own mock DirEntry and implement only the methods that you need (in your example, only path). For your example, you also have to mock os.path.isdir. Here is a self-contained example for how you can do this:
import os
from unittest.mock import patch
def get_paths(document_dir):
# example function containing your code
paths = []
for entry in os.scandir(document_dir):
if os.path.isdir(entry):
pass
else:
# else the file needs to be in a folder
file_path = entry.path.replace(os.sep, '/')
paths.append(file_path)
return paths
class DirEntry:
def __init__(self, path):
self.path = path
def path(self):
return self.path
#patch("os.scandir")
#patch("os.path.isdir")
def test_sut(mock_isdir, mock_scandir):
mock_isdir.return_value = False
mock_scandir.return_value = [DirEntry("docs.json")]
assert get_paths("anydir") == ["docs.json"]
Depending on your actual code, you may have to do more.
If you want to patch more file system functions, you may consider to use pyfakefs instead, which patches the whole file system. This will be overkill for a single test, but can be handy for a test suite relying on file system functions.
Disclaimer: I'm a contributor to pyfakefs.

Sharing Variables across new object instances

Background:
I am writing a module in order to set-up an embedded system. In this context I need to load some modules and perform some system settings.
Context:
I have a parent class holding some general code (load the config file, build ssh connection etc.) used for several child classes. One of them is the module class that sets up the module and therefore uses among otherthings the ssh connection and the configuration file.
My goal is to share the configuration file and the connection with the next module, that will be setup. For the connection its just a waste to build and destroy it all the time, but for the configuration file, changes during setup can lead to undefined behavior.
Research/ approaches:
I tried using class variables, however they aren't passed when initaiting
a new module object.
Futher, I tried using global variables, but since the parent class and the
child classes are in different files, This won't work (Yes, i can put them
all in one file but this will be a mess) Also using a getter function from
the file where I defined the global variable didn't work.
I am aware of the 'builtin' solution from
How to make a cross-module variable?
variable, but feel this would be a bit of an overkill...
Finally, I can keep the config file and and the connection in a central
script and pass them to each of the instances, but this will lead to loads
of dependencies and I don't think it's a good solution.
So here is a bit of code with an example method, to get some file paths.
The code is set up according to approach 1 (class vaiables)
An example config file:
Files:
Core:
Backend:
- 'file_1'
- 'file_2'
Local:
File_path:
Backend: '/the/path/to'
The Parent class in setup_class.py
import os
import abc
import yaml
class setup(object):
__metaclass__ = abc.ABCMeta
configuration = []
def check_for_configuration(self, config_file):
if not self.configuration:
with open(config_file, "r") as config:
self.configuration = yaml.safe_load(config)
def get_configuration(self):
return self.configuration
def _make_file_list(self, path, names):
full_file_path = []
for file_name in names:
temp_path = path + '/' + file_name
temp_path = temp_path.split('/')
full_file_path.append(os.path.join(*temp_path))
return full_file_path
#abc.abstractmethod
def install(self):
raise NotImplementedError
The module class in module_class.py
from setup_class import setup
class module(setup):
def __init__(self, module_name, config_file = ''):
self.name = module_name
self.check_for_configuration(config_file)
def install(self):
self._backend()
def _backend(self):
files = self._make_file_list(
self.configuration['Local']['File_path']['Backend'],
self.configuration['Files'][self.name]['Backend'])
if files:
print files
And finally a test script:
from module_class import module
Analysis = module('Analysis',"./example.yml")
Core = module('Core', "./example.yml")
Core.install()
Now, when running the code, the config file is loaded everytime, a new module object is initaiated. I would like to avoid this. Are there approaches I have not considdered? What's the neatest way to achive this?

Save your global values in a global dict, and refer to that inside your module.
cache = {}
class Cache(object):
def __init__(self):
global cache
self.cache = cache

How to save the state of a generator function in python? [duplicate]

Consider this scenario:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
walk = os.walk('/home')
for root, dirs, files in walk:
for pathname in dirs+files:
print os.path.join(root, pathname)
for root, dirs, files in walk:
for pathname in dirs+files:
print os.path.join(root, pathname)
I know that this example is kinda redundant, but you should consider that we need to use the same walk data more than once. I've a benchmark scenario and the use of same walk data is mandatory to get helpful results.
I've tried walk2 = walk to clone and use in the second iteration, but it didn't work. The question is... How can I copy it? Is it ever possible?
Thank you in advance.

You can use itertools.tee():
walk, walk2 = itertools.tee(walk)
Note that this might "need significant extra storage", as the documentation points out.

If you know you are going to iterate through the whole generator for every usage, you will probably get the best performance by unrolling the generator to a list and using the list multiple times.
walk = list(os.walk('/home'))

Define a function
def walk_home():
for r in os.walk('/home'):
yield r
Or even this
def walk_home():
return os.walk('/home')
Both are used like this:
for root, dirs, files in walk_home():
for pathname in dirs+files:
print os.path.join(root, pathname)

This is a good usecase for functools.partial()
to make a quick generator-factory:
from functools import partial
import os
walk_factory = partial(os.walk, '/home')
walk1, walk2, walk3 = walk_factory(), walk_factory(), walk_factory()
What functools.partial() does is hard to describe with human-words, but this^ is what it's for.
It partially fills out function-params without executing that function. Consequently it acts as a function/generator factory.

This answer aims to extend/elaborate on what the other answers have expressed. The solution will necessarily vary depending on what exactly you aim to achieve.
If you want to iterate over the exact same result of os.walk multiple times, you will need to initialize a list from the os.walk iterable's items (i.e. walk = list(os.walk(path))).
If you must guarantee the data remains the same, that is probably your only option. However, there are several scenarios in which this is not possible or desirable.
It will not be possible to list() an iterable if the output is of sufficient size (i.e. attempting to list() an entire filesystem may freeze your computer).
It is not desirable to list() an iterable if you wish to acquire "fresh" data prior to each use.
In the event that list() is not suitable, you will need to run your generator on demand. Note that generators are extinguised after each use, so this poses a slight problem. In order to "rerun" your generator multiple times, you can use the following pattern:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
class WalkMaker:
def __init__(self, path):
self.path = path
def __iter__(self):
for root, dirs, files in os.walk(self.path):
for pathname in dirs + files:
yield os.path.join(root, pathname)
walk = WalkMaker('/home')
for path in walk:
pass
# do something...
for path in walk:
pass
The aforementioned design pattern will allow you to keep your code DRY.

This "Python Generator Listeners" code allows you to have many listeners on a single generator, like os.walk, and even have someone "chime in" later.
def walkme():
os.walk('/home')
m1 = Muxer(walkme)
m2 = Muxer(walkme)
then m1 and m2 can run in threads even and process at their leisure.
See: https://gist.github.com/earonesty/cafa4626a2def6766acf5098331157b3
import queue
from threading import Lock
from collections import namedtuple
class Muxer():
Entry = namedtuple('Entry', 'genref listeners, lock')
already = {}
top_lock = Lock()
def __init__(self, func, restart=False):
self.restart = restart
self.func = func
self.queue = queue.Queue()
with self.top_lock:
if func not in self.already:
self.already[func] = self.Entry([func()], [], Lock())
ent = self.already[func]
self.genref = ent.genref
self.lock = ent.lock
self.listeners = ent.listeners
self.listeners.append(self)
def __iter__(self):
return self
def __next__(self):
try:
e = self.queue.get_nowait()
except queue.Empty:
with self.lock:
try:
e = self.queue.get_nowait()
except queue.Empty:
try:
e = next(self.genref[0])
for other in self.listeners:
if not other is self:
other.queue.put(e)
except StopIteration:
if self.restart:
self.genref[0] = self.func()
raise
return e
def __del__(self):
with self.top_lock:
try:
self.listeners.remove(self)
except ValueError:
pass
if not self.listeners and self.func in self.already:
del self.already[self.func]

Check Contents of Python Package without Running it?

I would like a function that, given a name which caused a NameError, can identify Python packages which could be imported to resolve it.
That part is fairly easy, and I've done it, but now I have an additional problem: I'd like to do it without causing side-effects. Here's the code I'm using right now:
def necessaryImportFor(name):
from pkgutil import walk_packages
for package in walk_packages():
if package[1] == name:
return name
try:
if hasattr(__import__(package[1]), name):
return package[1]
except Exception as e:
print("Can't check " + package[1] + " on account of a " + e.__class__.__name__ + ": " + str(e))
print("No possible import satisfies " + name)
The problem is that this code actually __import__s every module. This means that every side-effect of importing every module occurs. When testing my code I found that side-effects that can be caused by importing all modules include:
Launching tkinter applications
Requesting passwords with getpass
Requesting other input or raw_input
Printing messages (import this)
Opening websites (import antigravity)
A possible solution that I considered would be finding the path to every module (how? It seems to me that the only way to do this is by importing the module then using some methods from inspect on it), then parsing it to find every class, def, and = that isn't itself within a class or def, but that seems like a huge PITA and I don't think it would work for modules which are implemented in C/C++ instead of pure Python.
Another possibility is launching a child Python instance which has its output redirected to devnull and performing its checks there, killing it if it takes too long. That would solve the first four bullets, and the fifth one is such a special case that I could just skip antigravity. But having to start up thousands of instances of Python in this single function seems a bit... heavy and inefficient.
Does anyone have a better solution I haven't considered? Is there a simple way of just telling Python to generate an AST or something without actually importing a module, for example?

So I ended up writing a few methods which can list everything from a source file, without importing the source file.
The ast module doesn't seem particularly well documented, so this was a bit of a PITA trying to figure out how to extract everything of interest. Still, after ~6 hours of trial and error today, I was able to get this together and run it on the 3000+ Python source files on my computer without any exceptions being raised.
def listImportablesFromAST(ast_):
from ast import (Assign, ClassDef, FunctionDef, Import, ImportFrom, Name,
For, Tuple, TryExcept, TryFinally, With)
if isinstance(ast_, (ClassDef, FunctionDef)):
return [ast_.name]
elif isinstance(ast_, (Import, ImportFrom)):
return [name.asname if name.asname else name.name for name in ast_.names]
ret = []
if isinstance(ast_, Assign):
for target in ast_.targets:
if isinstance(target, Tuple):
ret.extend([elt.id for elt in target.elts])
elif isinstance(target, Name):
ret.append(target.id)
return ret
# These two attributes cover everything of interest from If, Module,
# and While. They also cover parts of For, TryExcept, TryFinally, and With.
if hasattr(ast_, 'body') and isinstance(ast_.body, list):
for innerAST in ast_.body:
ret.extend(listImportablesFromAST(innerAST))
if hasattr(ast_, 'orelse'):
for innerAST in ast_.orelse:
ret.extend(listImportablesFromAST(innerAST))
if isinstance(ast_, For):
target = ast_.target
if isinstance(target, Tuple):
ret.extend([elt.id for elt in target.elts])
else:
ret.append(target.id)
elif isinstance(ast_, TryExcept):
for innerAST in ast_.handlers:
ret.extend(listImportablesFromAST(innerAST))
elif isinstance(ast_, TryFinally):
for innerAST in ast_.finalbody:
ret.extend(listImportablesFromAST(innerAST))
elif isinstance(ast_, With):
if ast_.optional_vars:
ret.append(ast_.optional_vars.id)
return ret
def listImportablesFromSource(source, filename = '<Unknown>'):
from ast import parse
return listImportablesFromAST(parse(source, filename))
def listImportablesFromSourceFile(filename):
with open(filename) as f:
source = f.read()
return listImportablesFromSource(source, filename)
The above code covers the titular question: How do I check the contents of a Python package without running it?
But it leaves you with another question: How do I get the path to a Python package from just its name?
Here's what I wrote to handle that:
class PathToSourceFileException(Exception):
pass
class PackageMissingChildException(PathToSourceFileException):
pass
class PackageMissingInitException(PathToSourceFileException):
pass
class NotASourceFileException(PathToSourceFileException):
pass
def pathToSourceFile(name):
'''
Given a name, returns the path to the source file, if possible.
Otherwise raises an ImportError or subclass of PathToSourceFileException.
'''
from os.path import dirname, isdir, isfile, join
if '.' in name:
parentSource = pathToSourceFile('.'.join(name.split('.')[:-1]))
path = join(dirname(parentSource), name.split('.')[-1])
if isdir(path):
path = join(path, '__init__.py')
if isfile(path):
return path
raise PackageMissingInitException()
path += '.py'
if isfile(path):
return path
raise PackageMissingChildException()
from imp import find_module, PKG_DIRECTORY, PY_SOURCE
f, path, (suffix, mode, type_) = find_module(name)
if f:
f.close()
if type_ == PY_SOURCE:
return path
elif type_ == PKG_DIRECTORY:
path = join(path, '__init__.py')
if isfile(path):
return path
raise PackageMissingInitException()
raise NotASourceFileException('Name ' + name + ' refers to the file at path ' + path + ' which is not that of a source file.')
Trying the two bits of code together, I have this function:
def listImportablesFromName(name, allowImport = False):
try:
return listImportablesFromSourceFile(pathToSourceFile(name))
except PathToSourceFileException:
if not allowImport:
raise
return dir(__import__(name))
Finally, here's the implementation for the function that I mentioned I wanted in my question:
def necessaryImportFor(name):
packageNames = []
def nameHandler(name):
packageNames.append(name)
from pkgutil import walk_packages
for package in walk_packages(onerror=nameHandler):
nameHandler(package[1])
# Suggestion: Sort package names by count of '.', so shallower packages are searched first.
for package in packageNames:
# Suggestion: just skip any package that starts with 'test.'
try:
if name in listImportablesForName(package):
return package
except ImportError:
pass
except PathToSourceFileException:
pass
return None
And that's how I spent my Sunday.

What is NamedTemporaryFile useful for on Windows?

The Python module tempfile contains both NamedTemporaryFile and TemporaryFile. The documentation for the former says
Whether the name can be used to open the file a second time, while the named temporary file is still open, varies across platforms (it can be so used on Unix; it cannot on Windows NT or later)
What is the point of the file having a name if I can't use that name? If I want the useful (for me) behaviour of Unix on Windows, I've got to make a copy of the code and rip out all the bits that say if _os.name == 'nt' and the like.
What gives? Surely this is useful for something, since it was deliberately coded this way, but what is that something?

It states that accessing it a second time while it is still open. You can still use the name otherwise, just be sure to pass delete=False when creating the NamedTemporaryFile so that it persists after it is closed.

I use this:
import os, tempfile, gc
class TemporaryFile:
def __init__(self, name, io, delete):
self.name = name
self.__io = io
self.__delete = delete
def __getattr__(self, k):
return getattr(self.__io, k)
def __del__(self):
if self.__delete:
try:
os.unlink(self.name)
except FileNotFoundError:
pass
def NamedTemporaryFile(mode='w+b', bufsize=-1, suffix='', prefix='tmp', dir=None, delete=True):
if not dir:
dir = tempfile.gettempdir()
name = os.path.join(dir, prefix + os.urandom(32).hex() + suffix)
if mode is None:
return TemporaryFile(name, None, delete)
fh = open(name, "w+b", bufsize)
if mode != "w+b":
fh.close()
fh = open(name, mode)
return TemporaryFile(name, fh, delete)
def test_ntf_txt():
x = NamedTemporaryFile("w")
x.write("hello")
x.close()
assert os.path.exists(x.name)
with open(x.name) as f:
assert f.read() == "hello"
def test_ntf_name():
x = NamedTemporaryFile(suffix="s", prefix="p")
assert os.path.basename(x.name)[0] == 'p'
assert os.path.basename(x.name)[-1] == 's'
x.write(b"hello")
x.seek(0)
assert x.read() == b"hello"
def test_ntf_del():
x = NamedTemporaryFile(suffix="s", prefix="p")
assert os.path.exists(x.name)
name = x.name
del x
gc.collect()
assert not os.path.exists(name)
def test_ntf_mode_none():
x = NamedTemporaryFile(suffix="s", prefix="p", mode=None)
assert not os.path.exists(x.name)
name = x.name
f = open(name, "w")
f.close()
assert os.path.exists(name)
del x
gc.collect()
assert not os.path.exists(name)
works on all platforms, you can close it, open it up again, etc.
The feature mode=None is what you want, you can ask for a tempfile, specify mode=None, which gives you a UUID styled temp name with the dir/suffix/prefix that you want. The link references tests that show usage.
It's basically the same as NamedTemporaryFile, except the file will get auto-deleted when the object returned is garbage collected... not on close.

You don't want to "rip out all the bits...". It's coded like that for a reason. It says you can't open it a SECOND time while it's still open. Don't. Just use it once, and throw it away (after all, it is a temporary file). If you want a permanent file, create your own.
"Surely this is useful for something, since it was deliberately coded this way, but what is that something". Well, I've used it to write emails to (in a binary format) before copying them to a location where our Exchange Server picks them up & sends them. I'm sure there are lots of other use cases.

I'm pretty sure the Python library writers didn't just decide to make NamedTemporaryFile behave differently on Windows for laughs. All those _os.name == 'nt' tests will be there because of platform differences between Windows and Unix. So my inference from that documentation is that on Windows a file opened the way NamedTemporaryFile opens it cannot be opened again while NamedTemporaryFile still has it open, and that this is due to the way Windows works.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How should a Python module be structured? - python

Related

Is it possible to mock os.scandir and its attributes?

Sharing Variables across new object instances

How to save the state of a generator function in python? [duplicate]

Check Contents of Python Package without Running it?

What is NamedTemporaryFile useful for on Windows?

Categories

Resources