Use with in __iter__ - python

I have to open some text file and read it by line and return only string which contains numbers.
Is it a good idea to use with statement in _iter__? Like:
def __iter__(self):
with open(file_name) as fp:
for i in fp:
if is_number(i):
yield i
Or better way is:
def __enter__(self):
self._fp = open(self._file, 'r')
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self._fp.close()
def __iter__(self) -> int:
for tracker_id in self._fp:
if re.search('\d', tracker_id):
yield int(tracker_id)

You need a generator, rather than a context manager. To create one you could try something like this:
import re
def filter_lines(filename: str, pattern: str):
p = re.compile(pattern)
with open(filename) as f:
for line in f:
if re.search(p, line):
yield line
if __name__ == "__main__":
for line in filter_lines('myfile.txt', '\d'):
print(line)
Remember to compile your regex patterns if you're going to use them more than once.

I think the second form of the code is better.
The first version is dependent on the iterator returned by __iter__ only existing as long as the iteration is going on. If something happens to break out of the iteration without deallocating the iterator, then the file could be left open indefinitely.
Using it like this is mostly safe, since the object and its iterator will be garbage collected if an exception happens in the loop body, since there are no references to the iterator other than the one held by the for loop itself (though if garbage collection is turned off, it might not be safe on interpreters other than CPython):
for x in Whatever(): # assuming your methods are in a class named Whatever
# do stuff
This alternative use is probably not be safe, as the iterator will exist in the stack frame that might live on for quite some time as an exception is being handled:
it = iter(Whatever())
for x in it:
# do stuff
The second form of your code makes it explicit that the calling code is responsible for ensuring the resources get cleaned up properly. You'd call it with something like this, and can be confident that the file will be closed if an exception gets raised:
with Whatever() as w:
for x in w:
# do stuff
The main downside of the second version of the code is that you can't iterate on the same object more than once at the same time, since they share the same file object. If somebody wants to iterate twice over the same file, they'll need to create several instances of the class.
The one-use nature of the object might be more natural if it was an iterator itself, rather than just iterable (this is how file objects work, for instance):
class Whatever:
def __enter__(self):
self._fp = open(self._file, 'r')
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self._fp.close()
def __iter__(self):
return self
def __next__(self)
tracker_id = next(self._fp)
while re.search('\d', tracker_id) is None:
tracker_id = next(self._fp)
return int(tracker_id)
Note that we are deliberately not attempting to catch any StopIteration exception that might be raised by calling next on our file, as that will be our signal that we're done too.

In the first case, the file is opened when iteration is requested. That may incur extra I/O if multiple iterations are done. In the second case, the file is always opened when the object is used in a with statement, even if no iteration is done.
There are tradeoffs - one approach might be more efficient depending on how the object is used. If you need to support diverse usage patterns, you may want to combine the approaches. Lazily open the file the first time iteration is requested and then close it in __exit__. If you don't need that flexibility then choose the option that best fits how the object is likely to be used.

Related

Is there a Python standard library function to create a generator from repeatedly calling a functional?

I have a method want to call repeatedly to iterate over, which will raise a StopIteration when it's done (in this case an instance of pyarrow.csv.CSVStreamingReader looping over a large file). I can use it in a for loop like this:
def batch_generator():
while True:
try:
yield reader.read_next_batch()
except StopIteration:
return
for batch in batch_generator():
writer.write_table(batch)
It can be done in a generic way with a user-defined function:
def make_generator(f):
def gen():
while True:
try:
yield f()
except StopIteration:
return
return gen()
for batch in make_generator(reader.read_next_batch):
writer.write_table(batch)
...but I wondered if something like this was possible with standard library functions or with some obscure syntax?
I would assume that the normal iter() function with its second argument should do what you want. As in:
for batch in iter(reader.read_next_batch, None):
...
The answer to your underlying question of how to iterate a CSVStreamingReader is: The CSVStreamingReader is iterable and does just the thing you want:
reader = pyarrow.csv.open_csv(...)
for batch in reader:
...
In general it is really rare for python libraries to return "iterable" things that are not python-iterable. That is always a sensible first thing to try.

Python define variable as "load file at first use"

Python beginner here. I currently have some code that looks like
a=some_file_reading_function('filea')
b=some_file_reading_function('fileb')
# ...
if some_condition(a):
do_complicated_stuff(b)
else:
# nothing that involves b
What itches me is that the load of 'fileb' may not be necessary, and it has some performance penalty. Ideally, I would load it only if b is actually required later on. OTOH, b might be used multiple times, so if it is used once, it should load the file once for all. I do not know how to achieve that.
In the above pseudocode, one could trivially bring the loading of 'fileb' inside the conditional loop, but in reality there are more than two files and the conditional branching is quite complex. Also the code is still under heavy development and the conditional branching may change.
I looked a bit at either iterators or defining a class, but (probably due to my inexperience) could not make either work. The key problem I met was to load the file zero times if unneeded, and only once if needed. I found nothing on searching, because "how to load a file by chunks" pollutes the results for "file lazy loading" and similar queries.
If needed: Python 3.5 on Win7, and some_file_reading_function returns 1D- numpy.ndarray 's.
class LazyFile():
def __init__(self, file):
self.file = file
self._data = None
#property # so you can use .data instead of .data()
def data(self):
if self._data is None: # if not loaded
self._data = some_file_reading_function(self.file) #load it
return self._data
a = LazyFile('filea')
b = LazyFile('fileb')
if some_condition(a.data):
do_complicated_stuff(b.data)
else:
# other stuff
Actually, just found a workaround with classes. Try/except inspired by How to know if an object has an attribute in Python. A bit ugly, but does the job:
class Filecontents:
def __init__(self,filepath):
self.fp = filepath
def eval(self):
try:
self.val
except AttributeError:
self.val = some_file_reading_function(self.fp)
print(self.txt)
finally:
return self.val
def some_file_reading_function(fp):
# For demonstration purposes: say that you are loading something
print('Loading '+fp)
# Return value
return 0
a=Filecontents('somefile')
print('Not used yet')
print('Use #1: value={}'.format(a.eval()))
print('Use #2: value={}'.format(a.eval()))
Not sure that is the "best" (prettiest, most Pythonic) solution though.

Does writing a file to disk with Python open().write() ensure the data is available to other processes?

One Python process writes status updates to a file for other processes to read. In some circumstances, the status updates happen repeatedly and quickly in a loop. The easiest and fastest approach is to use to open().write() in one line:
open(statusfile,'w').write(status)
An alternate approach with four lines that force the data to disk. This has a significant performance penalty:
f = open(self.statusfile,'w')
f.write(status)
os.fsync(f)
f.close()
I'm not trying to protect from an OS crash. So, does the approach force the data to the OS buffer so other processes read the newest status data when they open the file from disk? Or, do I need to use os.fsync()?
No, the first approach does not guarantee that the data is written out, since it is not guaranteed that the file will be flushed and closed once the handle is no longer referenced by its write member. This is likely the case with CPython, but not necessarily with other Python interpreters; it's an implementation detail of the Python garbage collector.
You should really use the second approach, except that os.fsync is not needed; just close the file and the data should be available to other processes.
Or, even better (Python >=2.5):
with open(self.statusfile, 'w') as f:
f.write(status)
The with version is exception-safe: the file is closed even if write fails.
Since Python 2.2 it's been possible to subclass the language's built-in types. This means you could derive your own file type whose write() method returned self instead of nothing like the built-in version does. Doing so would make it possible to also chain a close() method call onto the end of your one-liner.
class ChainableFile(file):
def __init__(self, *args, **kwargs):
return file.__init__(self, *args, **kwargs)
def write(self, str):
file.write(self, str)
return self
def OpenFile(filename, *args, **kwargs):
return ChainableFile(filename, *args, **kwargs)
statusfile = 'statusfile.txt'
status = 'OK\n'
OpenFile(statusfile,'w').write(status).close()

equivalent of Python's "with" in Ruby

In Python, the with statement is used to make sure that clean-up code always gets called, regardless of exceptions being thrown or function calls returning. For example:
with open("temp.txt", "w") as f:
f.write("hi")
raise ValueError("spitespite")
Here, the file is closed, even though an exception was raised. A better explanation is here.
Is there an equivalent for this construct in Ruby? Or can you code one up, since Ruby has continuations?
Ruby has syntactically lightweight support for literal anonymous procedures (called blocks in Ruby). Therefore, it doesn't need a new language feature for this.
So, what you normally do, is to write a method which takes a block of code, allocates the resource, executes the block of code in the context of that resource and then closes the resource.
Something like this:
def with(klass, *args)
yield r = klass.open(*args)
ensure
r.close
end
You could use it like this:
with File, 'temp.txt', 'w' do |f|
f.write 'hi'
raise 'spitespite'
end
However, this is a very procedural way to do this. Ruby is an object-oriented language, which means that the responsibility of properly executing a block of code in the context of a File should belong to the File class:
File.open 'temp.txt', 'w' do |f|
f.write 'hi'
raise 'spitespite'
end
This could be implemented something like this:
def File.open(*args)
f = new(*args)
return f unless block_given?
yield f
ensure
f.close if block_given?
end
This is a general pattern that is implemented by lots of classes in the Ruby core library, standard libraries and third-party libraries.
A more close correspondence to the generic Python context manager protocol would be:
def with(ctx)
yield ctx.setup
ensure
ctx.teardown
end
class File
def setup; self end
alias_method :teardown, :close
end
with File.open('temp.txt', 'w') do |f|
f.write 'hi'
raise 'spitespite'
end
Note that this is virtually indistinguishable from the Python example, but it didn't require the addition of new syntax to the language.
The equivalent in Ruby would be to pass a block to the File.open method.
File.open(...) do |file|
#do stuff with file
end #file is closed
This is the idiom that Ruby uses and one that you should get comfortable with.
You could use Block Arguments to do this in Ruby:
class Object
def with(obj)
obj.__enter__
yield
obj.__exit__
end
end
Now, you could add __enter__ and __exit__ methods to another class and use it like this:
with GetSomeObject("somefile.text") do |foo|
do_something_with(foo)
end
I'll just add some more explanations for others; credit should go to them.
Indeed, in Ruby, clean-up code is as others said, in ensure clause; but wrapping things in blocks is ubiquitous in Ruby, and this is how it is done most efficiently and most in spirit of Ruby. When translating, don't translate directly word-for-word, you will get some very strange sentences. Similarly, don't expect everything from Python to have one-to-one correspondence to Ruby.
From the link you posted:
class controlled_execution:
def __enter__(self):
set things up
return thing
def __exit__(self, type, value, traceback):
tear things down
with controlled_execution() as thing:
some code
Ruby way, something like this (man, I'm probably doing this all wrong :D ):
def controlled_executor
begin
do_setup
yield
ensure
do_cleanup
end
end
controlled_executor do ...
some_code
end
Obviously, you can add arguments to both controlled executor (to be called in a usual fashion), and to yield (in which case you need to add arguments to the block as well). Thus, to implement what you quoted above,
class File
def my_open(file, mode="r")
handle = open(file, mode)
begin
yield handle
ensure
handle.close
end
end
end
File.my_open("temp.txt", "w") do |f|
f.write("hi")
raise Exception.new("spitesprite")
end
It's possible to write to a file atomically in Ruby, like so:
File.write("temp.txt", "hi")
raise ValueError("spitespite")
Writing code like this means that it is impossible to accidentally leave a file open.
You could always use a try..catch..finally block, where the finally section contains code to clean up.
Edit: sorry, misspoke: you'd want begin..rescue..ensure.
I believe you are looking for ensure.

Is this an acceptable pythonic idiom?

I have a class that assists in importing a special type of file, and a 'factory' class that allows me to do these in batch. The factory class uses a generator so the client can iterate through the importers.
My question is, did I use the iterator correctly? Is this an acceptable idiom? I've just started using Python.
class FileParser:
""" uses an open filehandle to do stuff """
class BatchImporter:
def __init__(self, files):
self.files=files
def parsers(self):
for file in self.files:
try:
fh = open(file, "rb")
parser = FileParser(fh)
yield parser
finally:
fh.close()
def verifyfiles(
def cleanup(
---
importer = BatchImporter(filelist)
for p in BatchImporter.parsers():
p.method1()
...
You could make one thing a little simpler: Instead of try...finally, use a with block:
with open(file, "rb") as fh:
yield FileParser(fh)
This will close the file for you automatically as soon as the with block is left.
It's absolutely fine to have a method that's a generator, as you do. I would recommend making all your classes new-style (if you're on Python 2, either set __metaclass__ = type at the start of your module, or add (object) to all your base-less class statements), because legacy classes are "evil";-); and, for clarity and conciseness, I would also recomment coding the generator differently...:
def parsers(self):
for afile in self.files:
with open(afile, "rb") as fh:
yield FileParser(fh)
but neither of these bits of advice condemns in any way the use of generator methods!-)
Note the use of afile in lieu of file: the latter is a built-in identifier, and as a general rule it's better to get used to not "hide" built-in identifiers with your own (it doesn't bite you here, but it will in many nasty ways in the future unless you get into the right habit!-).
The design is fine if you ask me, though using finally the way you use it isn't exactly idiomatic. Use catch and maybe re-raise the exception (using the raise keyword alone, otherwise you mess the stacktrace up), and for bonus points, don't catch: but catch Exception: (otherwise, you catch SystemExit and KeyboardInterrupt).
Or simply use the with-statement as shown by Tim Pietzcker.
In general, it isn't safe to close the file after you yield a parser object that will try to read it. Consider this code:
parsers = list(BatchImporter.parsers())
for p in parsers:
# the file object that p holds will already be closed!
If you're not writing a long-running daemon process, most of the time you don't need to worry about closing files -- they will all get closed when your program exits, or when the file objects are garbage-collected. (And if you use CPython, that will happen as soon as all references to them are lost, since CPython uses reference counting.)
Nevertheless, taking care to free resources is a good habit to acquire, so I would probably write the FileParser class this way:
class FileParser:
def __init__(self, file_or_filename, closing=False):
if hasattr(file_or_filename, 'read'):
self.f = file_or_filename
self._need_to_close = closing
else:
self.f = open(file_or_filename, 'rb')
self._need_to_close = True
def close(self):
if self._need_to_close:
self.f.close()
self._need_to_close = False
and then BatchImporter.parsers would become
def parsers(self):
for file in self.files:
yield FileParser(file)
or, if you love functional programming
def parsers(self):
return itertools.imap(FileParser, self.files)
An aside: if you're new to Python, I recommend you take a look at the Python style guide (also known as PEP 8). Two-space indents look weird.

Categories

Resources