How to accept both filenames and file-like objects in Python functions? - python

In my code, I have a load_dataset function that reads a text file and does some processing. Recently I thought about adding support to file-like objects, and I wondered over the best approach to this. Currently I have two implementations in mind:
First, type checking:
if isinstance(inputelement, basestring):
# open file, processing etc
# or
# elif hasattr(inputelement, "read"):
elif isinstance(inputelement, file):
# Do something else
Alternatively, two different arguments:
def load_dataset(filename=None, stream=None):
if filename is not None and stream is None:
# open file etc
elif stream is not None and filename is None:
# do something else
Both solutions however don't convince me too much, especially the second as I see way too many pitfalls.
What is the cleanest (and most Pythonic) way to accept a file-like object or string to a function that does text reading?

One way of having either a file name or a file-like object as argument is the implementation of a context manager that can handle both. An implementation can be found here, I quote for the sake of a self contained answer:
class open_filename(object):
"""Context manager that opens a filename and closes it on exit, but does
nothing for file-like objects.
"""
def __init__(self, filename, *args, **kwargs):
self.closing = kwargs.pop('closing', False)
if isinstance(filename, basestring):
self.fh = open(filename, *args, **kwargs)
self.closing = True
else:
self.fh = filename
def __enter__(self):
return self.fh
def __exit__(self, exc_type, exc_val, exc_tb):
if self.closing:
self.fh.close()
return False
Possible usage then:
def load_dataset(file_):
with open_filename(file_, "r") as f:
# process here, read only if the file_ is a string

Don't accept both files and strings. If you're going to accept file-like objects, then it means you won't check the type, just call the required methods on the actual parameter (read, write, etc.). If you're going to accept strings, then you're going to end up open-ing files, which means you won't be able to mock the parameters. So I'd say accept files, let the caller pass you a file-like object, and don't check the type.

I'm using a context manager wrapper. When it's a filename (str), close the file on exit.
#contextmanager
def fopen(filein, *args, **kwargs):
if isinstance(filein, str): # filename
with open(filein, *args, **kwargs) as f:
yield f
else: # file-like object
yield filein
Then you can use it like:
with fopen(filename_or_fileobj) as f:
# do sth. with f

Python follows duck typing, you may check that this is file object by function that you need from object. For example hasattr(obj, 'read') against isinstance(inputelement, file).
For converting string to file objects you may also use such construction:
if not hasattr(obj, 'read'):
obj = StringIO(str(obj))
After this code you will safely be able to use obj as a file.

Related

How to retrieve all the content of calls made to a mock?

I'm writing a unit test for a function that takes an array of dictionaries and ends up saving it in a CSV. I'm trying to mock it with pytest as usual:
csv_output = (
"Name\tSurname\r\n"
"Eve\tFirst\r\n"
)
with patch("builtins.open", mock_open()) as m:
export_csv_func(array_of_dicts)
assert m.assert_called_once_with('myfile.csv', 'wb') is None
[and here I want to gather all output sent to the mock "m" and assert it against "csv_output"]
I cannot get in any simple way all the data sent to the mock during the open() phase by csv to do the comparison in bulk, instead of line by line. To simplify things, I verified that the following code mimics the operations that export_csv_func() does to the mock:
with patch("builtins.open", mock_open()) as m:
with open("myfile.csv", "wb") as f:
f.write("Name\tSurname\r\n")
f.write("Eve\tFirst\r\n")
When I dig into the mock, I see:
>>> m
<MagicMock name='open' spec='builtin_function_or_method' id='4380173840'>
>>> m.mock_calls
[call('myfile.csv', 'wb'),
call().__enter__(),
call().write('Name\tSurname\r\n'),
call().write('Eve\tFirst\r\n'),
call().__exit__(None, None, None)]
>>> m().write.mock_calls
[call('Name\tSurname\r\n'), call('Eve\tFirst\r\n')]
>>> dir(m().write.mock_calls[0])
['__add__'...(many methods), '_mock_from_kall', '_mock_name', '_mock_parent', 'call_list', 'count', 'index']
I don't see anything in the MagickMock interface where I can gather all the input that the mock has received.
I also tried calling m().write.call_args but it only returns the last call (the last element of the mock_calls attribute, i.e. call('Eve\tFirst\r\n')).
Is there any way of doing what I want?
You can create your own mock.call objects and compare them with what you have in the .call_args_list.
from unittest.mock import patch, mock_open, call
with patch("builtins.open", mock_open()) as m:
with open("myfile.csv", "wb") as f:
f.write("Name\tSurname\r\n")
f.write("Eve\tFirst\r\n")
# Create your array of expected strings
expected_strings = ["Name\tSurname\r\n", "Eve\tFirst\r\n"]
write_calls = m().write.call_args_list
for expected_str in expected_strings:
# assert that a mock.call(expected_str) exists in the write calls
assert call(expected_str) in write_calls
Note that you can use the assert call of your choice. If you're in a unittest.TestCase subclass, prefer to use self.assertIn.
Additionally, if you just want the arg values you can unpack a mock.call object as tuples. Index 0 is the *args. For example:
for write_call in write_calls:
print('args: {}'.format(write_call[0]))
print('kwargs: {}'.format(write_call[1]))
Indeed you can't patch builtins.open.write directly since the patch within a with would need to enter the patched method and see that write is not a class method.
There are a bunch of solutions and the one I would think of first would be to use your own mock. See the example:
class MockOpenWrite:
def __init__(self, *args, **kwargs):
self.res = []
# What's actually mocking the write. Name must match
def write(self, s: str):
self.res.append(s)
# These 2 methods are needed specifically for the use of with.
# If you mock using a decorator, you don't need them anymore.
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
return
mock = MockOpenWrite
with patch("builtins.open", mock):
with open("myfile.csv", "w") as f:
f.write("Name\tSurname\r\n")
f.write("Eve\tFirst\r\n")
print(f.res)
In that case, the res attribute is linked to the instance. So it disappears after the with closes.
You could eventually stored results somewhere else, like a global array, and check the results beyond the end of with.
Feel free to play around with your actual method.
I had to it this way (Python 3.9). It was quite tedious just to get the mock-args out of the function.
from somewhere import my_thing
#patch("lib.function", return_value=MagicMock())
def test_my_thing(my_mock):
my_thing(value1, value2)
(value1_call_args, value2_call_args) = my_mock.call_args_list[0].args

Python: How to decode enum type from json

class MSG_TYPE(IntEnum):
REQUEST = 0
GRANT = 1
RELEASE = 2
FAIL = 3
INQUIRE = 4
YIELD = 5
def __json__(self):
return str(self)
class MessageEncoder(JSONEncoder):
def default(self, obj):
return obj.__json__()
class Message(object):
def __init__(self, msg_type, src, dest, data):
self.msg_type = msg_type
self.src = src
self.dest = dest
self.data = data
def __json__(self):
return dict (\
msg_type=self.msg_type, \
src=self.src, \
dest=self.dest, \
data=self.data,\
)
def ToJSON(self):
return json.dumps(self, cls=MessageEncoder)
msg = Message(msg_type=MSG_TYPE.FAIL, src=0, dest=1, data="hello world")
encoded_msg = msg.ToJSON()
decoded_msg = yaml.load(encoded_msg)
print type(decoded_msg['msg_type'])
When calling print type(decoded_msg['msg_type']), I get the result <type 'str'> instead of the original MSG_TYPTE type. I feel like I should also write a custom json decoder but kind of confused how to do that. Any ideas? Thanks.
When calling print type(decoded_msg['msg_type']), I get the result instead of the original MSG_TYPTE type.
Well, yeah, that's because you told MSG_TYPE to encode itself like this:
def __json__(self):
return str(self)
So, that's obviously going to decode back to a string. If you don't want that, come up with some unique way to encode the values, instead of just encoding their string representations.
The most common way to do this is to encode all of your custom types (including your enum types) using some specialized form of object—just like you've done for Message. For example, you might put a py-type field in the object which encodes the type of your object, and then the meanings of the other fields all depend on the type. Ideally you'll want to abstract out the commonalities instead of hardcoding the same thing 100 times, of course.
I feel like I should also write a custom json decoder but kind of confused how to do that.
Well, have you read the documentation? Where exactly are you confused? You're not going to get a complete tutorial by tacking on a followup to a StackOverflow question…
Assuming you've got a special object structure for all your types, you can use an object_hook to decode the values back to the originals. For example, as a quick hack:
class MessageEncoder(JSONEncoder):
def default(self, obj):
return {'py-type': type(obj).__name__, 'value': obj.__json__()}
class MessageDecoder(JSONDecoder):
def __init__(self, hook=None, *args, **kwargs):
if hook is None: hook = self.hook
return super().__init__(hook, *args, **kwargs)
def hook(self, obj):
if isinstance(obj, dict):
pytype = obj.get('py-type')
if pytype:
t = globals()[pytype]
return t.__unjson__(**obj['value'])
return obj
And now, in your Message class:
#classmethod
def __unjson__(cls, msg_type, src, dest, data):
return cls(msg_type, src, dest, data)
And you need a MSG_TYPE.__json__ that returns a dict, maybe just {'name': str(self)}, then an __unjson__ that does something like getattr(cls, name).
A real-life solution should probably either have the classes register themselves instead of looking them up by name, or should handle looking them up by qualified name instead of just going to globals(). And you may want to let things encode to something other than object—or, if not, to just cram py-type into the object instead of wrapping it in another one. And there may be other ways to make the JSON more compact and/or readable. And a little bit of error handling would be nice. And so on.
You may want to look at the implementation of jsonpickle—not because you want to do the exact same thing it does, but to see how it hooks up all the pieces.
Overriding the default method of the encoder won't matter in this case because your object never gets passed to the method. It's treated as an int.
If you run the encoder on its own:
msg_type = MSG_TYPE.RELEASE
MessageEncoder().encode(msg_type)
You'll get:
'MSG_TYPE.RELEASE'
If you can, use an Enum and you shouldn't have any issues. I also asked a similar question:
How do I serialize IntEnum from enum34 to json in python?

Use decorators to retrieve jsondata if file exists, otherwise run method and then store output as json?

I've read a little bit about decorators without my puny brain understanding them fully, but I believe this is one of the cases where they would be of use.
I have a main method running some other methods:
def run_pipeline():
gene_sequence_fasta_files_made = create_gene_sequence_fasta_files()
....several other methods taking one input argument and having one output argument.
Since each method takes a long time to run, I'd like to store the result in a json object for each method. If the json file exists I load it, otherwise I run the method and store the result. My current solution looks like this:
def run_pipeline():
gene_sequence_fasta_files_made = _load_or_make(create_gene_sequence_fasta_files, "/home/myfolder/ff.json", method_input=None)
...
Problem is, I find this really ugly and hard to read. If it is possible, how would I use decorators to solve this problem?
Ps. sorry for not showing my attempts. I haven't tried anything since I'm working against a deadline for a client and do not have the time (I could deliver the code above; I just find it aesthetically displeasing).
Psps. definition of _load_or_make() appended:
def _load_or_make(method, filename, method_input=None):
try:
with open(filename, 'r') as input_handle:
data = json.load(input_handle)
except IOError:
if method_input == None:
data = method()
else:
data = method(method_input)
with open(filename, 'w+') as output_handle:
json.dump(data, output_handle)
return data
Here's a decorator that tries loading json from the given filename, and if it can't find the file or the json load fails, it runs the original function, writes the result as json to disk, and returns.
def load_or_make(filename):
def decorator(func):
def wraps(*args, **kwargs):
try:
with open(filename, 'r') as f:
return json.load(input_handle)
except Exception:
data = func(*args, **kwargs)
with open(filename, 'w') as out:
json.dump(data, out)
return data
return wraps
return decorator
#load_or_make(filename)
def your_method_with_arg(arg):
# do stuff
return data
#load_or_make(other_filename)
def your_method():
# do stuff
return data
Note that there is an issue with this approach: if the decorated method returns different values depending on the arguments passed to it, the cache won't behave properly. It looks like that isn't a requirement for you, but if it is, you'd need to pick a different filename depending on the arguments passed in (or use pickle-based serialization, and just pickle a dict of args -> results). Here's an example of how to do it using a pickle approach, (very) loosely based on the memoized decorator Christian P. linked to:
import pickle
def load_or_make(filename):
def decorator(func):
def wrapped(*args, **kwargs):
# Make a key for the arguments. Try to make kwargs hashable
# by using the tuple returned from items() instead of the dict
# itself.
key = (args, kwargs.items())
try:
hash(key)
except TypeError:
# Don't try to use cache if there's an
# unhashable argument.
return func(*args, **kwargs)
try:
with open(filename) as f:
cache = pickle.load(f)
except Exception:
cache = {}
if key in cache:
return cache[key]
else:
value = func(*args, **kwargs)
cache[key] = value
with open(filename, "w") as f:
pickle.dump(cache, f)
return value
return wrapped
return decorator
Here, instead of saving the result as json, we pickle the result as a value in a dict, where the key is the arguments provided to the function. Note that you would still need to use a different filename for every function you decorate to ensure you never got incorrect results from the cache.
Do you want to save the results to disk or is in-memory okay? If so, you can use the memoize decorator / pattern, found here: https://wiki.python.org/moin/PythonDecoratorLibrary#Memoize
For each set of unique input arguments, it saves the result from the function in memory. If the function is then called again with the same arguments, it returns the result from memory rather than trying to run the function again.
It can also be altered to allow for a timeout (depending on how long your program runs for) so that if called after a certain time, it should re-run and re-cache the results.
A decorator is simply a callable that takes a function (or a class) as an argument, does something with/to it, and returns something (usually the function in a wrapper, or the class modified or registered):
Since Flat is better than nested I like to use classes if the decorator is at all complex:
class GetData(object):
def __init__(self, filename):
# this is called on the #decorator line
self.filename = filename
self.method_input = input
def __call__(self, func):
# this is called by Python with the completed def
def wrapper(*args, **kwds):
try:
with open(self.filename) as stored:
data = json.load(stored)
except IOError:
data = func(*args, **kwds)
with open(self.filename, 'w+') as stored:
json.dump(data, stored)
return data
return wrapper
and in use:
#GetData('/path/to/some/file')
def create_gene_sequence_fasta_files('this', 'that', these='those'):
pass
#GetData('/path/to/some/other/file')
def create_gene_sequence_fastb_files():
pass
I am no expert in python's decorator.I just learn it from a tutorial.But i think this can help u.But u may not get more readablity from it.
Decorator is a way to give ur different function the similar solution to deal with things,without make ur code mess or losing their readablity.It seems like transparent to the rest of ur code.
def _load_or_make(filename):
def _deco(method):
def __deco():
try:
with open(filename, 'r') as input_handle:
data = json.load(input_handle)
return data
except IOError:
if method_input == None:
data = method()
else:
data = method(method_input)
with open(filename, 'w+') as output_handle:
json.dump(data, output_handle)
return data
return __deco
return _deco
#_load_or_make(filename)
def method(arg):
#things need to be done
pass
return data

Python class __init__ layout?

In python, is it bad form to write an __init__ definition like:
class someFileType(object):
def __init__(self, path):
self.path = path
self.filename = self.getFilename()
self.client = self.getClient()
self.date = self.getDate()
self.title = self.getTitle()
self.filetype = self.getFiletype()
def getFilename(self):
'''Returns entire file name without extension'''
filename = os.path.basename(self.path)
filename = os.path.splitext(filename)
filename = filename[0]
return filename
def getClient(self):
'''Returns client name associated with file'''
client = self.filename.split()
client = client[1] # Assuming filename is formatted "date client - docTitle"
return client
where the initialized variables are calls to functions returning strings? Or is it considered lazy coding? It's mostly to save me from writing something.filetype as something.getFiletype() whenever I want to reference some aspect of the file.
This code is to sort files into folders by client, then by document type, and other manipulations based on data in the file name.
Nope, I don't see why that would be bad form. Calculating those values only once when the instance is created can be a great idea, in fact.
You could also postpone the calculations until needed by using caching propertys:
class SomeFileType(object):
_filename = None
_client = None
def __init__(self, path):
self.path = path
#property
def filename(self):
if self._filename is None:
filename = os.path.basename(self.path)
self._filename = os.path.splitext(filename)[0]
return self._filename
#property
def client(self):
'''Returns client name associated with file'''
if self._client is None:
client = self.filename.split()
self._client = client[1] # Assuming filename is formatted "date client - docTitle"
return self._client
Now, accessing somefiletypeinstance.client will trigger calculation of self.filename as needed, as well as cache the result of it's own calculation.
In this specific case, you may want to make .path a property as well; one with a setter that clears the cached values:
class SomeFileType(object):
_filename = None
_client = None
def __init__(self, path):
self._path = path
#property
def path(self):
return self._path
#path.setter
def path(self, value):
# clear all private instance attributes
for key in [k for k in vars(self) if k[0] == '_']:
delattr(self, key)
self._path = value
#property
def filename(self):
if self._filename is None:
filename = os.path.basename(self.path)
self._filename = os.path.splitext(filename)[0]
return self._filename
#property
def client(self):
'''Returns client name associated with file'''
if self._client is None:
client = self.filename.split()
self._client = client[1] # Assuming filename is formatted "date client - docTitle"
return self._client
Because property-based caching does add some complexity overhead, you need to consider if it is really worth your while; for your specific, simple example, it probably is not. The calculation cost for your attributes is very low indeed, and unless you plan to create large quantities of these classes, the overhead of calculating the properties ahead of time is negligible, compared to the mental cost of having to maintain on-demand caching properties.
Your code is doing two different things:
a) Simplifying the class API by exposing certain computed attributes as variables, rather than functions.
b) Precomputing their values.
The first task is what properties are for; a straightforward use would make your code simpler, not more complex, and (equally important) would make the intent clearer:
class someFileType(object):
#property
def filename(self):
return os.path.basename(self.path)
You can then write var.filename and you will dynamically compute the filename from the path.
#Martijn's solution adds caching, which also takes care of part b (precomputation). In your example, at least, the calculations are cheap so I don't see any benefit in doing so.
On the contrary, caching or precomputation raises consistency issues. Consider the following snippet:
something = someFileType("/home/me/document.txt")
print something.filename # prints `document`
...
something.path = "/home/me/document-v2.txt"
print something.filename # STILL prints `document` if you cache values
What should the last statement print? If you cache your computations, you will still get document instead of document-v2! Unless you are certain that nobody will try to change the value of the basic variable, you need to either avoid caching, or take measures to ensure consistency. The easiest way is to prohibit modifications to path-- one of the things that properties are designed to do.
Conclusion: Use properties to simplify your interface. Don't cache computations, unless it's necessitated by performance reasons. If you cache, take measures to ensure consistency, e.g. by making the underlying value read-only.
PS. The issues are analogous to database normalization (non-normalized designs raise consistency issues), but in python you have more resources for keeping things in sync.

Reading objects from file after dumping then to file

I've made a function to write objects to a file:
def StoreToFile(Thefile,objekt):
utfil=None
utfil=open(Thefile,'wb')
pickle.dump(objekt,utfil)
return True
if utfil is not None:
utfil.close()
And my code to use this function:
for st in Stadion:
StoreToFile(r'C:\pytest\prod.psr',st)
This works like a charm, but how can I put the objects back to a list object?
I have the code to extract the objects, but I'm unable to see how I can iterate through the objects to put them in a new list.
So far I have this:
def ReadFromFile(filename):
infile=None
infile=open(filename,'rb')
objekt=pickle.load(infile)
for st in Stadion:
StoreToFile(r'C:\pytest\prod.psr',st)
This works like a charm.
If you mean "run without errors", then yes, it does "work". This code repeatedly overwrites the file, so it will only contain the last item in the list.
Use this instead:
StoreToFile(r'C:\pytest\prod.psr', Stadion)
Your ReadFromFile() function should work just fine as it is and return a list (assuming above fix).
Also not sure what this does:
return True
if Thefile.close()
Your code is silly the utfil = None business doesn't make sense, because the only way open(...) can fail is with an exception, in which case the rest of the function won't be executed anyway. The right way to do this is with a context manager: the with statement.
Instead, do:
def storeToFile(path, o):
try:
with open(path, "wb") as f:
pickle.dump(o, f)
return True
except pickle.PicklingError, IOError:
return False
You should just pickle the whole list.
To pickle the objects to the same file, use this function:
def storeToFile(fileName, o):
try:
with open(fileName, "a") as file:
cPickle.dump(o, file)
return True
except pickle.PicklingError, IOError:
return False
Note that the file is opened with mode "a", so that new data is appended to the end.
To load the objects again, use this:
def loadEntireFile(fileName):
try:
with open(fileName) as file:
unpickler = cPickle.Unpickler(file)
while True:
yield unpickler.load()
except EOFError:
pass
This function tries to load objects from the file until it encounters EOF, which is indicated by an EOFError. You can use it like this:
foo = [str(x) for x in range(10)]
for x in foo:
storeToFile("test.pickle", x)
foo2 = list(load("test.pickle"))
The list function takes any iterable and builds a list from it. The function loadEntireFile contains a yield statement, making it a generator, so it can be passed to any function taking an iterable.

Categories

Resources