guaranteed unit test fixture for non-existent file - python

Imagine one writes a unit test for handle for a case when path does not exist:
def handle(path):
try:
with open(path) as f:
pass
except FileNotFoundError:
raise FileNotFoundError(path)
I would write something like below for such test:
import pytest
def test_handle_on_non_existent_path():
x = "abc" # some unbelievable string
with pytest.raises(FileNotFoundError):
handle(x)
My question is what is a better way to generate non-existent path for a unit-test.
My ideas are:
force delete a temporary file
genereate a random string like uuid?
"abc" is fairly concise, but in principle does guarantee path does not exist.
Update: in this question x is "no_exist.txt"

With respect to unit-testing, it seems your intent is to test the behaviour of your code for the case that open(path) will throw a FileNotFoundError. Your approach is to have the code actually perform the open call, but with a non-existent path name. This has some disadvantages: As you already noticed, the dependency on the real file system brings the question how to create a value for path that reliably does not exist as a file on the file system. But, there is another point, namely that you are not even sure that there could not exist another problem with the file system, for example some permission related problem, which could cause some other exception to be raised (OSError).
Put together, performing the call to open itself means you are not in full control of what happens. Therefore, a better approach can be, for this unit test case, to mock open and make your mock raise the FileNotFoundError.

Related

How can I reproducibly (py)test the failure mode of code that opens a file if the file is missing?

I would like to write a pytest case for the behavior of a function that opens a file in case that file does not exist.
I think the question boils down to a different one, namely "How can I be sure a file path does not exist on the file system?"
import pytest
def file_content(file_name):
with open(file_name, "r") as f:
return f.read()
def test_file_content_file_not_found():
file_name_of_inexistent_file = "???"
with pytest.raises(FileNotFoundError):
file_content(file_name_of_inexistent_file)
test_file_content_file_not_found()
"???" denotes the spot I think some great tool out there implements a reasonable and secure way to generate a file name or a mock file system that ensures the failure is guaranteed, but also doesn't need to change the file system.
At the moment, I have a small helper function that generates random strings that are tested if they are existing files and if they are not are returned. This way I can emulate the desired behavior. However, I guess there must be a more standard way to achieve this.
the easiest way is to use the builtin tmp_path fixture to generate a unique, empty directory:
def test_does_not_exist(tmp_path):
with pytest.raises(FileNotFoundError):
file_content(tmp_path.joinpath('dne'))
tmp_path is generated per-test and will start empty, so dne will never exist
if you want a generic, non-pytest solution you can use tempfile.TemporaryDirectory directly:
with tempfile.TemporaryDirectory() as tmpdir:
with pytest.raises(FileNotFoundError):
file_content(os.path.join(tmpdir, 'dne'))
disclaimer: I'm a pytest core dev

How do I handling exceptions with python generators using spaCy

I am using the spacy language.pipe method to process texts as a stream, and yield Doc objects in order. (https://spacy.io/api/language#pipe).
This method is faster than processing files one by one and takes a generator object as input.
If the system hits a "bad file" I want to ensure that I can identify it. However, I am not sure how to achieve this with Python generators. What is the best approach for ensuring I capture the error? I don't currently have a file to cause an error but will likely find one in production.
I am using spaCy version 2.1 and Python 3.6.3
import os
import spacy
nlp = spacy.load('en')
def genenerator():
path = "C:/Temp/tmp/" #place any text files here for testing
try:
for root, _, files in os.walk(path, topdown=False):
for name in files:
with open(os.path.join(root, name), 'r', encoding='utf-8', errors='ignore') as inputFileStream:
docText = inputFileStream.read()
yield (docText, name)
except Exception as e:
print('Error opening document. Doc name: {}'.format(os.path.join(root, name)), str(e))
def processfiles():
try:
for doc, file in nlp.pipe(genenerator(), as_tuples = True, batch_size=1000):
print (file)
except Exception as e:
print('Error processing file: {}'.format(file), str(e))
if __name__ == '__main__':
processfiles()
Edit - I have attempted to better explain my problem.
The specific thing I need to be able to do is to identify exactly what file caused a problem to spaCy, in particular I want to know exactly what file fails during this statement
for doc, file in nlp.pipe(genenerator(), as_tuples = True, batch_size=1000):
My assumption is that it could be possible to run into a file that causes spaCy to have an issue during the pipe statement (for example during the tagger or parser processing pipeline).
Originally I was processing the text into spaCy file by file, so if spaCy had a problem then I knew exactly what file caused it. Using a generator this seems to be harder.
I am confident that errors that occur in the generator method itself can be captured, especially taking on board the comments by John Rutledge.
Perhaps a better way to ask the question is how to I handle exception when generators are passed to methods like this.
My understanding is that the PIPE method will process the generator as a stream.
It looks like your main problem is that your try/catch statement will currently halt execution on the first error it encounters. To continue yielding files when an error is encountered you need to place your try/catch further down in the for-loop, i.e. you can wrap the with open context manager.
Note also that a blanket try/catch is considered an anti-pattern in Python, so typically you will want to catch and handle the errors explicitly instead of using the general purpose Exception. I included the more explicit IOerror and OSError as examples.
Lastly, because you can catch the errors in the generator itself the nlp.pipe function no longer needs the as_tuple param.
from pathlib import Path
import spacy
def grab_files(path):
for path in Path(path).rglob('*'):
if path.is_file():
try:
with open(str(path), 'r', encoding='utf-8', errors='ignore') as f:
yield f.read()
except (OSError, IOError) as err:
print(f'ERROR: {path}', err)
nlp = spacy.load('en')
for doc in nlp.pipe(grab_files('C:/Temp/tmp/'), batch_size=1000):
print(doc) # ... do something with spacy Doc here
*Edit - to answer followup question.
Note, you are still reading the contents of the text documents one at a time as you would have without a generator, however doing so via a generator returns an object that defers the execution until after you pass it into the nlp.pipe method. SpaCy then processes one batch of the text documents at a time via its internal util.minibatch function. That function ends in yield list(batch) which executes the code that opens/closes the files (1000 at a time in your case). So as regards any non-SpaCy related errors, i.e. errors associated with the opening/reading of the file, the code I posted should work as is.
However, as it stands, both your os.walk and my Path(path).rglob are indiscriminately picking up any file in the directory regardless of its filetype. So for example, if there were an .png file in your /tmp folder then SpaCy would raise a TypeError during the tokenization process. If you are wanting to capture those kinds of errors then your best bet is to anticipate and avoid them before sending them to SpaCy, e.g., by amending your code with a whitelist that only allows certain file extensions (.rglob('*.txt')).
If you are working on a project that for some reason or another cannot afford to be interrupted by an error, no matter the cost. And supposing you absolutely needed to know at which stage of the pipeline the error occurred, then one approach might be to create a custom pipeline component for each default SpaCy pipeline component (Tagger, DependencyParser, etc) you intend to use. You would then need to wrap said components in the blanket error handling/logging logic. Having done that you could then process your files using your completely custom pipeline. But, unless there is a gun pointed at your head I would not recommend it. Much better would be to anticipate the errors you expect to occur and handle them inside your generator. Perhaps someone with better knowledge of SpaCy's internals will have a better suggestion though.

Check the permissions of a file in python

I'm trying to check the readability of a file given the specified path. Here's what I have:
def read_permissions(filepath):
'''Checks the read permissions of the specified file'''
try:
os.access(filepath, os.R_OK) # Find the permissions using os.access
except IOError:
return False
return True
This works and returns True or False as the output when run. However, I want the error messages from errno to accompany it. This is what I think I would have to do (But I know that there is something wrong):
def read_permissions(filepath):
'''Checks the read permissions of the specified file'''
try:
os.access(filepath, os.R_OK) # Find the permissions using os.access
except IOError as e:
print(os.strerror(e)) # Print the error message from errno as a string
print("File exists.")
However, if I were to type in a file that does not exist, it tells me that the file exists. Can someone help me as to what I have done wrong (and what I can do to stay away from this issue in the future)? I haven't seen anyone try this using os.access. I'm also open to other options to test the permissions of a file. Can someone help me in how to raise the appropriate error message when something goes wrong?
Also, this would likely apply to my other functions (They still use os.access when checking other things, such as the existence of a file using os.F_OK and the write permissions of a file using os.W_OK). Here is an example of the kind of thing that I am trying to simulate:
>>> read_permissions("located-in-restricted-directory.txt") # Because of a permission error (perhaps due to the directory)
[errno 13] Permission Denied
>>> read_permissions("does-not-exist.txt") # File does not exist
[errno 2] No such file or directory
This is the kind of thing that I am trying to simulate, by returning the appropriate error message to the issue. I hope that this will help avoid any confusion about my question.
I should probably point out that while I have read the os.access documentation, I am not trying to open the file later. I am simply trying to create a module in which some of the components are to check the permissions of a particular file. I have a baseline (The first piece of code that I had mentioned) which serves as a decision maker for the rest of my code. Here, I am simply trying to write it again, but in a user-friendly way (not just True or just False, but rather with complete messages). Since the IOError can be brought up a couple different ways (such as permission denied, or non-existent directory), I am trying to get my module to identify and publish the issue. I hope that this helps you to help me determine any possible solutions.
os.access returns False when the file does not exist, regardless of the mode parameter passed.
This isn't stated explicitly in the documentation for os.access but it's not terribly shocking behavior; after all, if a file doesn't exist, you can't possibly access it. Checking the access(2) man page as suggested by the docs gives another clue, in that access returns -1 in a wide variety of conditions. In any case, you can simply do as I did and check the return value in IDLE:
>>> import os
>>> os.access('does_not_exist.txt', os.R_OK)
False
In Python it's generally discouraged to go around checking types and such before trying to actually do useful things. This philosophy is often expressed with the initialism EAFP, which stands for Easier to Ask Forgiveness than Permission. If you refer to the docs again, you'll see this is particularly relevant in the present case:
Note: Using access() to check if a user is authorized to e.g. open a file before actually doing so using open() creates a security
hole, because the user might exploit the short time interval between
checking and opening the file to manipulate it. It’s preferable to use
EAFP techniques. For example:
if os.access("myfile", os.R_OK):
with open("myfile") as fp:
return fp.read()
return "some default data"
is better written as:
try:
fp = open("myfile")
except IOError as e:
if e.errno == errno.EACCES:
return "some default data"
# Not a permission error.
raise
else:
with fp:
return fp.read()
If you have other reasons for checking permissions than second-guessing the user before calling open(), you could look to How do I check whether a file exists using Python? for some suggestions. Remember that if you really need an exception to be raised, you can always raise it yourself; no need to go hunting for one in the wild.
Since the IOError can be brought up a couple different ways (such as
permission denied, or non-existent directory), I am trying to get my
module to identify and publish the issue.
That's what the second approach above does. See:
>>> try:
... open('file_no_existy.gif')
... except IOError as e:
... pass
...
>>> e.args
(2, 'No such file or directory')
>>> try:
... open('unreadable.txt')
... except IOError as e:
... pass
...
>>> e.args
(13, 'Permission denied')
>>> e.args == (e.errno, e.strerror)
True
But you need to pick one approach or the other. If you're asking forgiveness, do the thing (open the file) in a try-except block and deal with the consequences appropriately. If you succeed, then you know you succeeded because there's no exception.
On the other hand, if you ask permission (aka LBYL, Look Before You Leap) in this that and the other way, you still don't know if you can successfully open the file until you actually do it. What if the file gets moved after you check its permissions? What if there's a problem you didn't think to check for?
If you still want to ask permission, don't use try-except; you're not doing the thing so you're not going to throw errors. Instead, use conditional statements with calls to os.access as the condition.

python unit testing os.remove fails file system

Am doing a bit of unit testing on a function which attempts to open a new file, but should fail if the file already exists. when the function runs sucessfully, the new file is created, so i want to delete it after every test run, but it doesn't seem to be working:
class MyObject_Initialisation(unittest.TestCase):
def setUp(self):
if os.path.exists(TEMPORARY_FILE_NAME):
try:
os.remove(TEMPORARY_FILE_NAME)
except WindowsError:
#TODO: can't figure out how to fix this...
#time.sleep(3)
#self.setUp() #this just loops forever
pass
def tearDown(self):
self.setUp()
any thoughts? The Windows Error thrown seems to suggest the file is in use... could it be that the tests are run in parallel threads?
I've read elsewhere that it's 'bad practice' to use the filesystem in unit testing, but really? Surely there's a way around this that doesn't invole dummying the filesystem?
If you're just looking for a temporary file, have a look at tempfile - this should handle the clean-up all on its own.
Do you remember to explicitly close file handler that operates on TEMPORARY_FILE_NAME?
From Python Documentation:
On Windows, attempting to remove a
file that is in use causes an
exception to be raised;

Python Error-Checking Standard Practice

I have a question regarding error checking in Python. Let's say I have a function that takes a file path as an input:
def myFunction(filepath):
infile = open(filepath)
#etc etc...
One possible precondition would be that the file should exist.
There are a few possible ways to check for this precondition, and I'm just wondering what's the best way to do it.
i) Check with an if-statement:
if not os.path.exists(filepath):
raise IOException('File does not exist: %s' % filepath)
This is the way that I would usually do it, though the same IOException would be raised by Python if the file does not exist, even if I don't raise it.
ii) Use assert to check for the precondition:
assert os.path.exists(filepath), 'File does not exist: %s' % filepath
Using asserts seems to be the "standard" way of checking for pre/postconditions, so I am tempted to use these. However, it is possible that these asserts are turned off when the -o flag is used during execution, which means that this check might potentially be turned off and that seems risky.
iii) Don't handle the precondition at all
This is because if filepath does not exist, there will be an exception generated anyway and the exception message is detailed enough for user to know that the file does not exist
I'm just wondering which of the above is the standard practice that I should use for my codes.
If all you want to do is raise an exception, use option iii:
def myFunction(filepath):
with open(filepath) as infile:
pass
To handle exceptions in a special way, use a try...except block:
def myFunction(filepath):
try:
with open(filepath) as infile:
pass
except IOError:
# special handling code here
Under no circumstance is it preferable to check the existence of the file first (option i or ii) because in the time between when the check or assertion occurs and when python tries to open the file, it is possible that the file could be deleted, or altered (such as with a symlink), which can lead to bugs or a security hole.
Also, as of Python 2.6, the best practice when opening files is to use the with open(...) syntax. This guarantees that the file will be closed, even if an exception occurs inside the with-block.
In Python 2.5 you can use the with syntax if you preface your script with
from __future__ import with_statement
Definitely don't use an assert. Asserts should only fail if the code is wrong. External conditions (such as missing files) shouldn't be checked with asserts.
As others have pointed out, asserts can be turned off.
The formal semantics of assert are:
The condition may or may not be evaluated (so don't rely on side effects of the expression).
If the condition is true, execution continues.
It is undefined what happens if the condition is false.
More on this idea.
The following extends from ~unutbu's example. If the file doesn't exist, or on any other type of IO error, the filename is also passed along in the error message:
path = 'blam'
try:
with open(path) as f:
print f.read()
except IOError as exc:
raise IOError("%s: %s" % (path, exc.strerror))
=> IOError: blam: No such file or directory
I think you should go with a mix of iii) and i). If you know for a fact, that python will throw the exception (i.e. case iii), then let python do it. If there are some other preconditions (e.g. demanded by your business logic) you should throw own exceptions, maybe even derive from Exception.
Using asserts is too fragile imho, because they might be turned off.

Categories

Resources