I am using the spacy language.pipe method to process texts as a stream, and yield Doc objects in order. (https://spacy.io/api/language#pipe).
This method is faster than processing files one by one and takes a generator object as input.
If the system hits a "bad file" I want to ensure that I can identify it. However, I am not sure how to achieve this with Python generators. What is the best approach for ensuring I capture the error? I don't currently have a file to cause an error but will likely find one in production.
I am using spaCy version 2.1 and Python 3.6.3
import os
import spacy
nlp = spacy.load('en')
def genenerator():
path = "C:/Temp/tmp/" #place any text files here for testing
try:
for root, _, files in os.walk(path, topdown=False):
for name in files:
with open(os.path.join(root, name), 'r', encoding='utf-8', errors='ignore') as inputFileStream:
docText = inputFileStream.read()
yield (docText, name)
except Exception as e:
print('Error opening document. Doc name: {}'.format(os.path.join(root, name)), str(e))
def processfiles():
try:
for doc, file in nlp.pipe(genenerator(), as_tuples = True, batch_size=1000):
print (file)
except Exception as e:
print('Error processing file: {}'.format(file), str(e))
if __name__ == '__main__':
processfiles()
Edit - I have attempted to better explain my problem.
The specific thing I need to be able to do is to identify exactly what file caused a problem to spaCy, in particular I want to know exactly what file fails during this statement
for doc, file in nlp.pipe(genenerator(), as_tuples = True, batch_size=1000):
My assumption is that it could be possible to run into a file that causes spaCy to have an issue during the pipe statement (for example during the tagger or parser processing pipeline).
Originally I was processing the text into spaCy file by file, so if spaCy had a problem then I knew exactly what file caused it. Using a generator this seems to be harder.
I am confident that errors that occur in the generator method itself can be captured, especially taking on board the comments by John Rutledge.
Perhaps a better way to ask the question is how to I handle exception when generators are passed to methods like this.
My understanding is that the PIPE method will process the generator as a stream.
It looks like your main problem is that your try/catch statement will currently halt execution on the first error it encounters. To continue yielding files when an error is encountered you need to place your try/catch further down in the for-loop, i.e. you can wrap the with open context manager.
Note also that a blanket try/catch is considered an anti-pattern in Python, so typically you will want to catch and handle the errors explicitly instead of using the general purpose Exception. I included the more explicit IOerror and OSError as examples.
Lastly, because you can catch the errors in the generator itself the nlp.pipe function no longer needs the as_tuple param.
from pathlib import Path
import spacy
def grab_files(path):
for path in Path(path).rglob('*'):
if path.is_file():
try:
with open(str(path), 'r', encoding='utf-8', errors='ignore') as f:
yield f.read()
except (OSError, IOError) as err:
print(f'ERROR: {path}', err)
nlp = spacy.load('en')
for doc in nlp.pipe(grab_files('C:/Temp/tmp/'), batch_size=1000):
print(doc) # ... do something with spacy Doc here
*Edit - to answer followup question.
Note, you are still reading the contents of the text documents one at a time as you would have without a generator, however doing so via a generator returns an object that defers the execution until after you pass it into the nlp.pipe method. SpaCy then processes one batch of the text documents at a time via its internal util.minibatch function. That function ends in yield list(batch) which executes the code that opens/closes the files (1000 at a time in your case). So as regards any non-SpaCy related errors, i.e. errors associated with the opening/reading of the file, the code I posted should work as is.
However, as it stands, both your os.walk and my Path(path).rglob are indiscriminately picking up any file in the directory regardless of its filetype. So for example, if there were an .png file in your /tmp folder then SpaCy would raise a TypeError during the tokenization process. If you are wanting to capture those kinds of errors then your best bet is to anticipate and avoid them before sending them to SpaCy, e.g., by amending your code with a whitelist that only allows certain file extensions (.rglob('*.txt')).
If you are working on a project that for some reason or another cannot afford to be interrupted by an error, no matter the cost. And supposing you absolutely needed to know at which stage of the pipeline the error occurred, then one approach might be to create a custom pipeline component for each default SpaCy pipeline component (Tagger, DependencyParser, etc) you intend to use. You would then need to wrap said components in the blanket error handling/logging logic. Having done that you could then process your files using your completely custom pipeline. But, unless there is a gun pointed at your head I would not recommend it. Much better would be to anticipate the errors you expect to occur and handle them inside your generator. Perhaps someone with better knowledge of SpaCy's internals will have a better suggestion though.
Related
Imagine one writes a unit test for handle for a case when path does not exist:
def handle(path):
try:
with open(path) as f:
pass
except FileNotFoundError:
raise FileNotFoundError(path)
I would write something like below for such test:
import pytest
def test_handle_on_non_existent_path():
x = "abc" # some unbelievable string
with pytest.raises(FileNotFoundError):
handle(x)
My question is what is a better way to generate non-existent path for a unit-test.
My ideas are:
force delete a temporary file
genereate a random string like uuid?
"abc" is fairly concise, but in principle does guarantee path does not exist.
Update: in this question x is "no_exist.txt"
With respect to unit-testing, it seems your intent is to test the behaviour of your code for the case that open(path) will throw a FileNotFoundError. Your approach is to have the code actually perform the open call, but with a non-existent path name. This has some disadvantages: As you already noticed, the dependency on the real file system brings the question how to create a value for path that reliably does not exist as a file on the file system. But, there is another point, namely that you are not even sure that there could not exist another problem with the file system, for example some permission related problem, which could cause some other exception to be raised (OSError).
Put together, performing the call to open itself means you are not in full control of what happens. Therefore, a better approach can be, for this unit test case, to mock open and make your mock raise the FileNotFoundError.
A parser I created reads recorded chess games from a file. The API is used like this:
import chess.pgn
pgn_file = open("games.pgn")
first_game = chess.pgn.read_game(pgn_file)
second_game = chess.pgn.read_game(pgn_file)
# ...
Sometimes illegal moves (or other problems) are encountered. What is a good Pythonic way to handle them?
Raising exceptions as soon as the error is encountered. However, this makes every problem fatal, in that execution stops. Often, there is still useful data that has been parsed and could be returned. Also, you can not simply continue parsing the next data set, because we are still in the middle of some half-read data.
Accumulating exceptions and raising them at the end of the game. This makes the error fatal again, but at least you can catch it and continue parsing the next game.
Introduce an optional argument like this:
game = chess.pgn.read_game(pgn_file, parser_info)
if parser_info.error:
# This appears to be quite verbose.
# Now you can at least make the best of the sucessfully parsed parts.
# ...
Are some of these or other methods used in the wild?
The most Pythonic way is the logging module. It has been mentioned in comments but unfortunately without stressing this hard enough. There are many reasons it's preferable to warnings:
Warnings module is intended to report warnings about potential code issues, not bad user data.
First reason is actually enough. :-)
Logging module provides adjustable message severity: not only warnings, but anything from debug messages to critical errors can be reported.
You can fully control output of logging module. Messages can be filtered by their source, contents and severity, formatted in any way you wish, sent to different output targets (console, pipes, files, memory etc)...
Logging module separates actual error/warning/message reporting and output: your code can generate messages of appropriate type and doesn't have to bother how they're presented to end user.
Logging module is the de-facto standard for Python code. Everyone everywhere is using it. So if your code is using it, combining it with 3rd party code (which is likely using logging too) will be a breeze. Well, maybe something stronger than breeze, but definitely not a category 5 hurricane. :-)
A basic use case for logging module would look like:
import logging
logger = logging.getLogger(__name__) # module-level logger
# (tons of code)
logger.warning('illegal move: %s in file %s', move, file_name)
# (more tons of code)
This will print messages like:
WARNING:chess_parser:illegal move: a2-b7 in file parties.pgn
(assuming your module is named chess_parser.py)
The most important thing is that you don't need to do anything else in your parser module. You declare that you're using logging system, you're using a logger with a specific name (same as your parser module name in this example) and you're sending warning-level messages to it. Your module doesn't have to know how these messages are processed, formatted and reported to user. Or if they're reported at all. For example, you can configure logging module (usually at the very start of your program) to use a different format and dump it to file:
logging.basicConfig(filename = 'parser.log', format = '%(name)s [%(levelname)s] %(message)s')
And suddenly, without any changes to your module code, your warning messages are saved to a file with a different format instead of being printed to screen:
chess_parser [WARNING] illegal move: a2-b7 in file parties.pgn
Or you can suppress warnings if you wish:
logging.basicConfig(level = logging.ERROR)
And your module's warnings will be ignored completely, while any ERROR or higher-level messages from your module will still be processed.
Actually, those are fatal errors -- at least, as far as being able to reproduce a correct game; on the other hand, maybe the player actually did make the illegal move and nobody noticed at the time (which would make it a warning, not a fatal error).
Given the possibility of both fatal errors (file is corrupted) and warnings (an illegal move was made, but subsequent moves show consistency with that move (in other words, user error and nobody caught it at the time)) I recommend a combination of the first and second options:
raise an exception when continued parsing isn't an option
collect any errors/warnings that don't preclude further parsing until the end
If you don't encounter a fatal error then you can return the game, plus any warnings/non-fatal errors, at the end:
return game, warnings, errors
But what if you do hit a fatal error?
No problem: create a custom exception to which you can attach the usable portion of the game and any other warnings/non-fatal errors to:
raise ParsingError(
'error explanation here',
game=game,
warnings=warnings,
errors=errors,
)
then when you catch the error you can access the recoverable portion of the game, along with the warnings and errors.
The custom error might be:
class ParsingError(Exception):
def __init__(self, msg, game, warnings, errors):
super().__init__(msg)
self.game = game
self.warnings = warnings
self.errors = errors
and in use:
try:
first_game, warnings, errors = chess.pgn.read_game(pgn_file)
except chess.pgn.ParsingError as err:
first_game = err.game
warnings = err.warnings
errors = err.errors
# whatever else you want to do to handle the exception
This is similar to how the subprocess module handles errors.
For the ability to retrieve and parse subsequent games after a game fatal error I would suggest a change in your API:
have a game iterator that simply returns the raw data for each game (it only has to know how to tell when one game ends and the next begins)
have the parser take that raw game data and parse it (so it's no longer in charge of where in the file you happen to be)
This way if you have a five-game file and game two dies, you can still attempt to parse games 3, 4, and 5.
I offered the bounty because I'd like to know if this is really the best way to do it. However, I'm also writing a parser and so I need this functionality, and this is what I've come up with.
The warnings module is exactly what you want.
What turned me away from it at first was that every example warning used in the docs looks like these:
Traceback (most recent call last):
File "warnings_warn_raise.py", line 15, in <module>
warnings.warn('This is a warning message')
UserWarning: This is a warning message
...which is undesirable because I don't want it to be a UserWarning, I want my own custom warning name.
Here's the solution to that:
import warnings
class AmbiguousStatementWarning(Warning):
pass
def x():
warnings.warn("unable to parse statement syntax",
AmbiguousStatementWarning, stacklevel=3)
print("after warning")
def x_caller():
x()
x_caller()
which gives:
$ python3 warntest.py
warntest.py:12: AmbiguousStatementWarning: unable to parse statement syntax
x_caller()
after warning
I'm not sure if the solution is pythonic or not, but I use it rather often with slight modifications: a parser does its job within a generator and yields results and a status code. The receiving code makes decisions what to to with failed items:
def process_items(items)
for item in items:
try:
#process item
yield processed_item, None
except StandardError, err:
yield None, (SOME_ERROR_CODE, str(err), item)
for processed, err in process_items(items):
if err:
# process and log err, collect failed items, etc.
continue
# further process processed
A more general approach is to practice in using design patterns. A simplified version of Observer (when you register callbacks for specific errors) or a kind of Visitor (where the visitor has methods for procesing specific errors, see SAX parser for insights) might be a clear and well understood solution.
Without libraries, it is difficult to do this cleanly, but still possible.
There are different methods of handling this, depending on the situation.
Method 1:
Put all contents of while loop inside the following:
while 1:
try:
#codecodecode
except Exception as detail:
print detail
Method 2:
Same as Method 1, except having multiple try/except thingies, so it doesn't skip too much code & you know the exact location of the error.
Sorry, in a rush, hope this helps!
A parser I created reads recorded chess games from a file. The API is used like this:
import chess.pgn
pgn_file = open("games.pgn")
first_game = chess.pgn.read_game(pgn_file)
second_game = chess.pgn.read_game(pgn_file)
# ...
Sometimes illegal moves (or other problems) are encountered. What is a good Pythonic way to handle them?
Raising exceptions as soon as the error is encountered. However, this makes every problem fatal, in that execution stops. Often, there is still useful data that has been parsed and could be returned. Also, you can not simply continue parsing the next data set, because we are still in the middle of some half-read data.
Accumulating exceptions and raising them at the end of the game. This makes the error fatal again, but at least you can catch it and continue parsing the next game.
Introduce an optional argument like this:
game = chess.pgn.read_game(pgn_file, parser_info)
if parser_info.error:
# This appears to be quite verbose.
# Now you can at least make the best of the sucessfully parsed parts.
# ...
Are some of these or other methods used in the wild?
The most Pythonic way is the logging module. It has been mentioned in comments but unfortunately without stressing this hard enough. There are many reasons it's preferable to warnings:
Warnings module is intended to report warnings about potential code issues, not bad user data.
First reason is actually enough. :-)
Logging module provides adjustable message severity: not only warnings, but anything from debug messages to critical errors can be reported.
You can fully control output of logging module. Messages can be filtered by their source, contents and severity, formatted in any way you wish, sent to different output targets (console, pipes, files, memory etc)...
Logging module separates actual error/warning/message reporting and output: your code can generate messages of appropriate type and doesn't have to bother how they're presented to end user.
Logging module is the de-facto standard for Python code. Everyone everywhere is using it. So if your code is using it, combining it with 3rd party code (which is likely using logging too) will be a breeze. Well, maybe something stronger than breeze, but definitely not a category 5 hurricane. :-)
A basic use case for logging module would look like:
import logging
logger = logging.getLogger(__name__) # module-level logger
# (tons of code)
logger.warning('illegal move: %s in file %s', move, file_name)
# (more tons of code)
This will print messages like:
WARNING:chess_parser:illegal move: a2-b7 in file parties.pgn
(assuming your module is named chess_parser.py)
The most important thing is that you don't need to do anything else in your parser module. You declare that you're using logging system, you're using a logger with a specific name (same as your parser module name in this example) and you're sending warning-level messages to it. Your module doesn't have to know how these messages are processed, formatted and reported to user. Or if they're reported at all. For example, you can configure logging module (usually at the very start of your program) to use a different format and dump it to file:
logging.basicConfig(filename = 'parser.log', format = '%(name)s [%(levelname)s] %(message)s')
And suddenly, without any changes to your module code, your warning messages are saved to a file with a different format instead of being printed to screen:
chess_parser [WARNING] illegal move: a2-b7 in file parties.pgn
Or you can suppress warnings if you wish:
logging.basicConfig(level = logging.ERROR)
And your module's warnings will be ignored completely, while any ERROR or higher-level messages from your module will still be processed.
Actually, those are fatal errors -- at least, as far as being able to reproduce a correct game; on the other hand, maybe the player actually did make the illegal move and nobody noticed at the time (which would make it a warning, not a fatal error).
Given the possibility of both fatal errors (file is corrupted) and warnings (an illegal move was made, but subsequent moves show consistency with that move (in other words, user error and nobody caught it at the time)) I recommend a combination of the first and second options:
raise an exception when continued parsing isn't an option
collect any errors/warnings that don't preclude further parsing until the end
If you don't encounter a fatal error then you can return the game, plus any warnings/non-fatal errors, at the end:
return game, warnings, errors
But what if you do hit a fatal error?
No problem: create a custom exception to which you can attach the usable portion of the game and any other warnings/non-fatal errors to:
raise ParsingError(
'error explanation here',
game=game,
warnings=warnings,
errors=errors,
)
then when you catch the error you can access the recoverable portion of the game, along with the warnings and errors.
The custom error might be:
class ParsingError(Exception):
def __init__(self, msg, game, warnings, errors):
super().__init__(msg)
self.game = game
self.warnings = warnings
self.errors = errors
and in use:
try:
first_game, warnings, errors = chess.pgn.read_game(pgn_file)
except chess.pgn.ParsingError as err:
first_game = err.game
warnings = err.warnings
errors = err.errors
# whatever else you want to do to handle the exception
This is similar to how the subprocess module handles errors.
For the ability to retrieve and parse subsequent games after a game fatal error I would suggest a change in your API:
have a game iterator that simply returns the raw data for each game (it only has to know how to tell when one game ends and the next begins)
have the parser take that raw game data and parse it (so it's no longer in charge of where in the file you happen to be)
This way if you have a five-game file and game two dies, you can still attempt to parse games 3, 4, and 5.
I offered the bounty because I'd like to know if this is really the best way to do it. However, I'm also writing a parser and so I need this functionality, and this is what I've come up with.
The warnings module is exactly what you want.
What turned me away from it at first was that every example warning used in the docs looks like these:
Traceback (most recent call last):
File "warnings_warn_raise.py", line 15, in <module>
warnings.warn('This is a warning message')
UserWarning: This is a warning message
...which is undesirable because I don't want it to be a UserWarning, I want my own custom warning name.
Here's the solution to that:
import warnings
class AmbiguousStatementWarning(Warning):
pass
def x():
warnings.warn("unable to parse statement syntax",
AmbiguousStatementWarning, stacklevel=3)
print("after warning")
def x_caller():
x()
x_caller()
which gives:
$ python3 warntest.py
warntest.py:12: AmbiguousStatementWarning: unable to parse statement syntax
x_caller()
after warning
I'm not sure if the solution is pythonic or not, but I use it rather often with slight modifications: a parser does its job within a generator and yields results and a status code. The receiving code makes decisions what to to with failed items:
def process_items(items)
for item in items:
try:
#process item
yield processed_item, None
except StandardError, err:
yield None, (SOME_ERROR_CODE, str(err), item)
for processed, err in process_items(items):
if err:
# process and log err, collect failed items, etc.
continue
# further process processed
A more general approach is to practice in using design patterns. A simplified version of Observer (when you register callbacks for specific errors) or a kind of Visitor (where the visitor has methods for procesing specific errors, see SAX parser for insights) might be a clear and well understood solution.
Without libraries, it is difficult to do this cleanly, but still possible.
There are different methods of handling this, depending on the situation.
Method 1:
Put all contents of while loop inside the following:
while 1:
try:
#codecodecode
except Exception as detail:
print detail
Method 2:
Same as Method 1, except having multiple try/except thingies, so it doesn't skip too much code & you know the exact location of the error.
Sorry, in a rush, hope this helps!
Problem:
I have code that looks for a file and open it. By default it looks for file that starts with ####### (each # being a number).
Problem is sometimes the file name is ##-##### and other times #####.
I would like a way if the file cannot be found try looking for the other two ways the file could be written.
An IOError exception happens when the file is not found. What I was thinking was to have an except statement that says:
except File2:
Look for ##### in myfindFileFunction()
if file is still not found run except File3
except File3:
Look for ##-#### in myfindFileFuction()
except:
print "File not found"
What I am not sure of is how to set up custom exception to work this way, and/or if there is a more pythonic way to do this altogether...
Would setting up a pattern or the three possible file names and iterate thought each until the file is found work better?
Using try/except is indeed a very pythonic (and fast) way of doing things.
You have to weigh not only if it's pythonic, but what impact does that approach has in terms of readability. Will you still understand the code quickly when you look at it again in 6 months? Will somebody else?
I usually make sure that slightly complex try/except clauses to handle this kind of things are well commented. Asides from that... it's a perfectly reasonable way of doing it.
Also, to put your mind at ease regarding performance, a common concern when one is deciding between two approaches, take a look here: Python if vs try-except and you'll see that try/except constructs are fast in Python... really fast.
no custom exception needed
import errno
try:
open('somefile')
except IOError as e:
if e.errno == errno.ENOENT:
open('someotherfilename')
else:
raise e
(this is on *nix- im not sure if you're using windows)
It's easy enough to define your own exceptions -- just create a class derived from Exception. The doco is clear.
However creating separate exceptions per file type, or any exception at all, doesn't seem necessary. You could do something like:
files = ('#######', "##-#####', '#####')
fh = None
for f in files:
try:
fh = open(f)
break
except IOError as e:
if e.errno in (errno.ENOENT,):
pass
else:
raise
if not fh:
## all three tries failed
The use of if around e.errno lets you decide which IO errors mean go on to the next file and which are errors you want to know about . File does not exists (errno.ENOENT) means try next file. But others like 'Too many files open' (errno.ENFILE) probably need a different response.
I have a question regarding error checking in Python. Let's say I have a function that takes a file path as an input:
def myFunction(filepath):
infile = open(filepath)
#etc etc...
One possible precondition would be that the file should exist.
There are a few possible ways to check for this precondition, and I'm just wondering what's the best way to do it.
i) Check with an if-statement:
if not os.path.exists(filepath):
raise IOException('File does not exist: %s' % filepath)
This is the way that I would usually do it, though the same IOException would be raised by Python if the file does not exist, even if I don't raise it.
ii) Use assert to check for the precondition:
assert os.path.exists(filepath), 'File does not exist: %s' % filepath
Using asserts seems to be the "standard" way of checking for pre/postconditions, so I am tempted to use these. However, it is possible that these asserts are turned off when the -o flag is used during execution, which means that this check might potentially be turned off and that seems risky.
iii) Don't handle the precondition at all
This is because if filepath does not exist, there will be an exception generated anyway and the exception message is detailed enough for user to know that the file does not exist
I'm just wondering which of the above is the standard practice that I should use for my codes.
If all you want to do is raise an exception, use option iii:
def myFunction(filepath):
with open(filepath) as infile:
pass
To handle exceptions in a special way, use a try...except block:
def myFunction(filepath):
try:
with open(filepath) as infile:
pass
except IOError:
# special handling code here
Under no circumstance is it preferable to check the existence of the file first (option i or ii) because in the time between when the check or assertion occurs and when python tries to open the file, it is possible that the file could be deleted, or altered (such as with a symlink), which can lead to bugs or a security hole.
Also, as of Python 2.6, the best practice when opening files is to use the with open(...) syntax. This guarantees that the file will be closed, even if an exception occurs inside the with-block.
In Python 2.5 you can use the with syntax if you preface your script with
from __future__ import with_statement
Definitely don't use an assert. Asserts should only fail if the code is wrong. External conditions (such as missing files) shouldn't be checked with asserts.
As others have pointed out, asserts can be turned off.
The formal semantics of assert are:
The condition may or may not be evaluated (so don't rely on side effects of the expression).
If the condition is true, execution continues.
It is undefined what happens if the condition is false.
More on this idea.
The following extends from ~unutbu's example. If the file doesn't exist, or on any other type of IO error, the filename is also passed along in the error message:
path = 'blam'
try:
with open(path) as f:
print f.read()
except IOError as exc:
raise IOError("%s: %s" % (path, exc.strerror))
=> IOError: blam: No such file or directory
I think you should go with a mix of iii) and i). If you know for a fact, that python will throw the exception (i.e. case iii), then let python do it. If there are some other preconditions (e.g. demanded by your business logic) you should throw own exceptions, maybe even derive from Exception.
Using asserts is too fragile imho, because they might be turned off.