For yaml load use open vs with - python

I have a program that reads yaml configs, just read only. I am wondering which one of the following is more Pythonic
try:
config = yaml.load(open(filepath))
except Exception as error:
print error
vs using a with statement
try:
with open(filepath) as f:
config = yaml.load(f)
except Exception as error:
print error
I prefer the first one cause its simpler to read and since there are no write I don't think there will be issues with file closing gracefully. Thoughts?

Use the second. From the documentation:
It is good practice to use the with keyword when dealing with file
objects. This has the advantage that the file is properly closed after
its suite finishes, even if an exception is raised on the way.

Related

Is it still unsafe to process files without using with in python 3? [duplicate]

This question already has answers here:
Why is `with open()` better for opening files in Python?
(4 answers)
File read using "open()" vs "with open()" [duplicate]
(1 answer)
What is the python "with" statement designed for?
(11 answers)
Closed 3 years ago.
For the sake of a small example. Let's say that I want to read each line from this file into a list:
First line
Second line
Third line
Fourth line
and it's called example.txt and in my cwd.
I've always been told that I should do it using a with statement like these:
# method 1
lines = []
with open("example.txt") as handle:
for line in handle:
lines.append(line.strip("\n"))
# method 2
with open("example.txt") as handle:
lines = [line.strip("\n") for line in handle]
There's a lot of info out there on this.. I found some good stuff in
What is the python keyword "with" used for?. So it seems this is/was recommended so that the file was properly closed, especially in the cases of exceptions being thrown during processing.
# method 3
handle = open("example.txt")
lines = [line.strip("\n") for line in handle]
handle.close()
# method 4
lines = [line.strip("\n") for line in open("example.txt")]
For that reason the above two methods would be frowned upon. But all of the information I could find on the subject is extremely old, most of it being aimed at python 2.x. sometimes using with instead of these methods within more complex code is just impractical.
I know that in python3 at some point variables within a list comprehension were limited to their own scope (so that the variables would not leak out of the comprehension).. could that take care of the issue in method 4? Would I still experience issues when using method 3 if an exception was thrown before I explicity called the .close() statement?
Is the insistence on always using with to open file objects just old habits dying hard?
The with keyword is a context manager. It is not required to work with a file. However, it is good practice.
It is good practice to use the with keyword when dealing with file
objects. The advantage is that the file is properly closed after its
suite finishes, even if an exception is raised at some point. Using
with is also much shorter than writing equivalent try-finally blocks...
Python Tutorial Section 7.2
Using it is syntactic sugar for a try/except block similar to:
f = open('example.txt', 'w')
try:
f.write('hello, world')
finally:
f.close()
You could explicitly write that block out whenever you want to work with a file. In my opinion it is more pythonic, and easier to read, to use the context manager syntax.
with open('example.txt', 'w') as f:
f.write('hello, world')
If you would like to learn a little more about context managers check out this blog post and of course the Python Documentation on contextlib.
There is also a 5th way:
handle = None
try:
handle = open("example.txt")
except:
pass
finally:
if handle is not None:
handle.close()
Note, this would not work, in case open would throw an exception, handle.close() would crash the program.:
try:
handle = open("example.txt")
except:
pass
finally:
handle.close()
Yes. Closing files is still needed. This concept is present in all programming languages and is caused by buffering, which improves performance of working with IO. To avoid the need of closing file handles you would need to find a way of reading files without buffering.

How do I handling exceptions with python generators using spaCy

I am using the spacy language.pipe method to process texts as a stream, and yield Doc objects in order. (https://spacy.io/api/language#pipe).
This method is faster than processing files one by one and takes a generator object as input.
If the system hits a "bad file" I want to ensure that I can identify it. However, I am not sure how to achieve this with Python generators. What is the best approach for ensuring I capture the error? I don't currently have a file to cause an error but will likely find one in production.
I am using spaCy version 2.1 and Python 3.6.3
import os
import spacy
nlp = spacy.load('en')
def genenerator():
path = "C:/Temp/tmp/" #place any text files here for testing
try:
for root, _, files in os.walk(path, topdown=False):
for name in files:
with open(os.path.join(root, name), 'r', encoding='utf-8', errors='ignore') as inputFileStream:
docText = inputFileStream.read()
yield (docText, name)
except Exception as e:
print('Error opening document. Doc name: {}'.format(os.path.join(root, name)), str(e))
def processfiles():
try:
for doc, file in nlp.pipe(genenerator(), as_tuples = True, batch_size=1000):
print (file)
except Exception as e:
print('Error processing file: {}'.format(file), str(e))
if __name__ == '__main__':
processfiles()
Edit - I have attempted to better explain my problem.
The specific thing I need to be able to do is to identify exactly what file caused a problem to spaCy, in particular I want to know exactly what file fails during this statement
for doc, file in nlp.pipe(genenerator(), as_tuples = True, batch_size=1000):
My assumption is that it could be possible to run into a file that causes spaCy to have an issue during the pipe statement (for example during the tagger or parser processing pipeline).
Originally I was processing the text into spaCy file by file, so if spaCy had a problem then I knew exactly what file caused it. Using a generator this seems to be harder.
I am confident that errors that occur in the generator method itself can be captured, especially taking on board the comments by John Rutledge.
Perhaps a better way to ask the question is how to I handle exception when generators are passed to methods like this.
My understanding is that the PIPE method will process the generator as a stream.
It looks like your main problem is that your try/catch statement will currently halt execution on the first error it encounters. To continue yielding files when an error is encountered you need to place your try/catch further down in the for-loop, i.e. you can wrap the with open context manager.
Note also that a blanket try/catch is considered an anti-pattern in Python, so typically you will want to catch and handle the errors explicitly instead of using the general purpose Exception. I included the more explicit IOerror and OSError as examples.
Lastly, because you can catch the errors in the generator itself the nlp.pipe function no longer needs the as_tuple param.
from pathlib import Path
import spacy
def grab_files(path):
for path in Path(path).rglob('*'):
if path.is_file():
try:
with open(str(path), 'r', encoding='utf-8', errors='ignore') as f:
yield f.read()
except (OSError, IOError) as err:
print(f'ERROR: {path}', err)
nlp = spacy.load('en')
for doc in nlp.pipe(grab_files('C:/Temp/tmp/'), batch_size=1000):
print(doc) # ... do something with spacy Doc here
*Edit - to answer followup question.
Note, you are still reading the contents of the text documents one at a time as you would have without a generator, however doing so via a generator returns an object that defers the execution until after you pass it into the nlp.pipe method. SpaCy then processes one batch of the text documents at a time via its internal util.minibatch function. That function ends in yield list(batch) which executes the code that opens/closes the files (1000 at a time in your case). So as regards any non-SpaCy related errors, i.e. errors associated with the opening/reading of the file, the code I posted should work as is.
However, as it stands, both your os.walk and my Path(path).rglob are indiscriminately picking up any file in the directory regardless of its filetype. So for example, if there were an .png file in your /tmp folder then SpaCy would raise a TypeError during the tokenization process. If you are wanting to capture those kinds of errors then your best bet is to anticipate and avoid them before sending them to SpaCy, e.g., by amending your code with a whitelist that only allows certain file extensions (.rglob('*.txt')).
If you are working on a project that for some reason or another cannot afford to be interrupted by an error, no matter the cost. And supposing you absolutely needed to know at which stage of the pipeline the error occurred, then one approach might be to create a custom pipeline component for each default SpaCy pipeline component (Tagger, DependencyParser, etc) you intend to use. You would then need to wrap said components in the blanket error handling/logging logic. Having done that you could then process your files using your completely custom pipeline. But, unless there is a gun pointed at your head I would not recommend it. Much better would be to anticipate the errors you expect to occur and handle them inside your generator. Perhaps someone with better knowledge of SpaCy's internals will have a better suggestion though.

Check that python function is defined for a particular input?

I have tried the following which doesn't work:
try:
f = h5py.File(filepath)
except NameError:
f = scipy.io.loadmat(filepath)
Basically, a user will pass a particular input(a filepath) to a function which is
supposed to load the data in that file. But, I don't expect the user to know whether the function is defined for that input.
I'm getting the following error:
OSError: Unable to create file (Unable to open file: name = '/users/cyrilrocke/documents/c_elegans/data/chemotaxis/n2laura_benzaldehyde_l_2013_03_17__15_39_19___2____features.mat', errno = 17, error message = 'file exists', flags = 15, o_flags = a02)
Note: basically, I want to be able to switch between the function h5py.File() and scipy.io.loadmat() depending on which one doesn't work. Given the inputs, one of these functions must work.
I guess you want something like that:
flag = True;
try:
f = h5py.File(filepath)
except:
flag = False;
try:
if not flag:
f = scipy.io.loadmat(filepath)
except:
print('Error')
The problems in your code are:
you are catching for the wrong error. Without specifying the error type you can handle all kinds of error in a single way (you can also add multiple except clauses)
you are executing a potentially failing function inside the except clause. In this way you won't be able to handle new errors. I then splitted them in two separate try/except blocks.
First: your code should know whether a function is defined! That's not something external, it is something that is entirely in your power.
If an import might fail, maybe do something like
try:
import h5py
HAVE_H5PY = True
except ImportError:
HAVE_H5PY = False
and then use that to check.
Second, you could instead do something like
if 'h5py' in globals():
but that is ugly and unnecessary (you have to catch the ImportError anyway, if that is the problem you're trying to solve).
Third, your error message has nothing to do with any of this, it is returned by one of those two commands. It is trying to create a file that already exists.
error message = 'file exists'
Your program is getting EEXIST and if you see a difference between trials with h5py and scipy.io, it's because they have different flags that they're passing to open() and perhaps differing states if you cleanup the files created between trials.
From the h5py.File docs:
Valid modes are:
...
r Readonly, file must exist
w Create file, truncate if exists
w- or x Create file, fail if exists
a Read/write if exists, create otherwise (default)
Since you're looking to read the file (only?) you should specify 'r' as the mode.
If you're experiencing this problem only rarely, then perhaps it's a race generated by a TOCTOU-style error in h5py.
Indeed, such a TOCTOU design error exists in h5py.
https://github.com/h5py/h5py/blob/c699e741de64fda8ac49081e9ee1be43f8eae822/h5py/_hl/files.py#L99
# Open in append mode (read/write).
# If that fails, create a new file only if it won't clobber an
# existing one (ACC_EXCL)
try:
fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
except IOError:
fid = h5f.create(name, h5f.ACC_EXCL, fapl=fapl, fcpl=fcpl)
This is the reason for the O_EXCL flag used by posix-style OSs. h5f.ACC_EXCL likely does the same thing, but if it's only used upon failure of h5f.ACC_RDWR then its atomicity is diminished.

Python: If file not found try looking for different filename

Problem:
I have code that looks for a file and open it. By default it looks for file that starts with ####### (each # being a number).
Problem is sometimes the file name is ##-##### and other times #####.
I would like a way if the file cannot be found try looking for the other two ways the file could be written.
An IOError exception happens when the file is not found. What I was thinking was to have an except statement that says:
except File2:
Look for ##### in myfindFileFunction()
if file is still not found run except File3
except File3:
Look for ##-#### in myfindFileFuction()
except:
print "File not found"
What I am not sure of is how to set up custom exception to work this way, and/or if there is a more pythonic way to do this altogether...
Would setting up a pattern or the three possible file names and iterate thought each until the file is found work better?
Using try/except is indeed a very pythonic (and fast) way of doing things.
You have to weigh not only if it's pythonic, but what impact does that approach has in terms of readability. Will you still understand the code quickly when you look at it again in 6 months? Will somebody else?
I usually make sure that slightly complex try/except clauses to handle this kind of things are well commented. Asides from that... it's a perfectly reasonable way of doing it.
Also, to put your mind at ease regarding performance, a common concern when one is deciding between two approaches, take a look here: Python if vs try-except and you'll see that try/except constructs are fast in Python... really fast.
no custom exception needed
import errno
try:
open('somefile')
except IOError as e:
if e.errno == errno.ENOENT:
open('someotherfilename')
else:
raise e
(this is on *nix- im not sure if you're using windows)
It's easy enough to define your own exceptions -- just create a class derived from Exception. The doco is clear.
However creating separate exceptions per file type, or any exception at all, doesn't seem necessary. You could do something like:
files = ('#######', "##-#####', '#####')
fh = None
for f in files:
try:
fh = open(f)
break
except IOError as e:
if e.errno in (errno.ENOENT,):
pass
else:
raise
if not fh:
## all three tries failed
The use of if around e.errno lets you decide which IO errors mean go on to the next file and which are errors you want to know about . File does not exists (errno.ENOENT) means try next file. But others like 'Too many files open' (errno.ENFILE) probably need a different response.

Python Error-Checking Standard Practice

I have a question regarding error checking in Python. Let's say I have a function that takes a file path as an input:
def myFunction(filepath):
infile = open(filepath)
#etc etc...
One possible precondition would be that the file should exist.
There are a few possible ways to check for this precondition, and I'm just wondering what's the best way to do it.
i) Check with an if-statement:
if not os.path.exists(filepath):
raise IOException('File does not exist: %s' % filepath)
This is the way that I would usually do it, though the same IOException would be raised by Python if the file does not exist, even if I don't raise it.
ii) Use assert to check for the precondition:
assert os.path.exists(filepath), 'File does not exist: %s' % filepath
Using asserts seems to be the "standard" way of checking for pre/postconditions, so I am tempted to use these. However, it is possible that these asserts are turned off when the -o flag is used during execution, which means that this check might potentially be turned off and that seems risky.
iii) Don't handle the precondition at all
This is because if filepath does not exist, there will be an exception generated anyway and the exception message is detailed enough for user to know that the file does not exist
I'm just wondering which of the above is the standard practice that I should use for my codes.
If all you want to do is raise an exception, use option iii:
def myFunction(filepath):
with open(filepath) as infile:
pass
To handle exceptions in a special way, use a try...except block:
def myFunction(filepath):
try:
with open(filepath) as infile:
pass
except IOError:
# special handling code here
Under no circumstance is it preferable to check the existence of the file first (option i or ii) because in the time between when the check or assertion occurs and when python tries to open the file, it is possible that the file could be deleted, or altered (such as with a symlink), which can lead to bugs or a security hole.
Also, as of Python 2.6, the best practice when opening files is to use the with open(...) syntax. This guarantees that the file will be closed, even if an exception occurs inside the with-block.
In Python 2.5 you can use the with syntax if you preface your script with
from __future__ import with_statement
Definitely don't use an assert. Asserts should only fail if the code is wrong. External conditions (such as missing files) shouldn't be checked with asserts.
As others have pointed out, asserts can be turned off.
The formal semantics of assert are:
The condition may or may not be evaluated (so don't rely on side effects of the expression).
If the condition is true, execution continues.
It is undefined what happens if the condition is false.
More on this idea.
The following extends from ~unutbu's example. If the file doesn't exist, or on any other type of IO error, the filename is also passed along in the error message:
path = 'blam'
try:
with open(path) as f:
print f.read()
except IOError as exc:
raise IOError("%s: %s" % (path, exc.strerror))
=> IOError: blam: No such file or directory
I think you should go with a mix of iii) and i). If you know for a fact, that python will throw the exception (i.e. case iii), then let python do it. If there are some other preconditions (e.g. demanded by your business logic) you should throw own exceptions, maybe even derive from Exception.
Using asserts is too fragile imho, because they might be turned off.

Categories

Resources