My main usage of generators is processing of rows of CSV files stored on a remote server. It allows me to have consistent interfaces of linearly processing the data stored in them.
Now, I am using paramiko in order to access an SFTP server that stores the files - and paramiko has an outstanding issue of not properly closing connections if you did not close the file itself.
I've got a simple interface of accessing a single file on the sftp (this is obviously a pseudocode - I am omitting the connection error handling code and so on).
def sftp_read_file(filename):
with paramiko.open(filename) as file_obj:
for item in csv.reader(file_obj):
yield item
def csv_append_column(iter_obj, col_name, col_val):
# header
yield next(iter_obj) + (col_name, )
for item in iter_obj:
yield item + (col_val, )
Let's say I would like to test a set of transformations done to the file by running the script for a limited amount of rows:
def main():
for i, item in enumerate(csv_append_column(sftp_read_file('sftp://...'), 'A', 'B')):
print(item)
if i > 0 and i % 100 == 0:
break
The script will exit, but the interpreter will never terminate without SIGINT. What are my possible solutions?
This isn’t the most elegant solution, but maybe we could build off #tadhg-mcdonald-jensen’s suggestion by wrapping the generator in an object:
class Stoppable(object):
def __init__(self, fn):
self.generator = fn
def __enter__(self):
return self.generator
def __exit__(self, type_, value, traceback):
self.generator.close()
And then use it like this:
def main():
with Stoppable(sftp_read_file('sftp://...')) as reader:
for i, item in enumerate(csv_append_column(reader, 'A', 'B')):
print(item)
if i > 0 and i % 100 == 0:
break
Alternatively, we can just wrap the generator itself if we aren't using the generator methodology for streaming:
def stopit(fn):
rg = [ x for x in fn ]
for x in rg:
yield x
Now we can call it like:
def main():
for i, item in enumerate(csv_append_column(stopit(sftp_read_file('...')), 'A', 'B')):
print(item)
if i > 0 and i % 100 == 0:
break
This will make sure the with block exits and paramiko closes the sftp connection but comes at the expense of reading all of the lines into memory at once.
Related
I need to sort large text file that is separated by a newline character.
I need to assume that input data is too big to fit into main memory, meaning that I can only read and store one line of the file in the memory. Therefore, I can't make a list or array for to use in a classic sorting algorithm (like mergesort, quicksort etc.) and because of that I'm stuck.
How could one approach that kind of a problem?
In practice, under normal circumstances, just use the Unix sort command.
LC_ALL=C sort file_in.txt > file_out.txt
For a very large file, I'd go distributed use the sort build into a mapreduce. See How does the MapReduce sort algorithm work? for how that one works. One tip, if you go distributed, stay distributed. That is, the input file, output file, and operation should all be on distributed filesystems so you nowhere have a bottleneck on any one machine.
Exactly once have I faced a situation where these two would not work. The situation was where I needed to sort and organize a dataset that was coming from a database, but the data was too big to sort in the database, and the machine that I was on did not have space for the raw dataset.
I solved that one with a mergesort where all chunks above a certain size were kept in compressed files. The key logic was something like this:
for row in big query:
last_chunk = new chunk from row
chunk = None
while 0 < len(chunks):
chunk = chunks.pop()
if chunk.size < 1.2 * last_chunk.size:
last_chunk = merge(chunk, last_chunk)
else:
break
if chunk is not None:
chunks.append(chunk)
chunks.append(last_chunk)
while 1 < len(chunks):
chunks.append(merge(chunks.pop(), chunks.pop()))
And then in the merge logic I had the reasoning about whether a chunk should wind up in memory or in a compressed file, how to get the rows, and how to write them.
(My problem was not so simple as this, because I was grouping, summing, merging, etc. Basically duplicating a report I couldn't run in the database because it didn't have enough working memory to run it.)
Those three cover every situation I've personally encountered with large files.
Here is a not great but working implementation of an external file-based mergesort.
First you need demo.txt that we want to sort:
hello
world
greetings
earthlings
And now Python code to print it in sorted order:
from tempfile import TemporaryFile
class FileBuffer():
def __init__ (self):
self.current_line = None
self.fh = TemporaryFile()
self.state = 'writing'
self.size = 0
def add (self, line):
if self.state != 'writing':
raise Exception(f"Cannot write to FileBuffer in state {self.state}")
self.size += len(line)
self.fh.write(bytes(line, encoding='utf-8'))
def finish_writing (self):
self.fh.seek(0, 0)
self.state = 'reading'
self.fetch()
return self.current_line
def fetch (self):
if self.state != 'reading':
raise Exception(f"Cannot read from FileBuffer in state {self.state}")
self.current_line = bytes.decode(self.fh.readline())
if self.current_line == '':
self.current_line = None
self.state = 'done'
return self.current_line
def __iter__(self):
return self
def __next__(self):
line = self.current_line
if line is None:
raise StopIteration
else:
self.fetch()
return line
class BufferedSort():
def __init__ (self):
self.buffers = []
def merge (self):
buffer1 = self.buffers.pop()
buffer2 = self.buffers.pop()
new_buffer = FileBuffer()
while buffer1.current_line is not None and buffer2.current_line is not None:
if buffer1.current_line < buffer2.current_line:
new_buffer.add(buffer1.current_line)
buffer1.fetch()
else:
new_buffer.add(buffer2.current_line)
buffer2.fetch()
while buffer1.current_line is not None:
new_buffer.add(buffer1.current_line)
buffer1.fetch()
while buffer2.current_line is not None:
new_buffer.add(buffer2.current_line)
buffer2.fetch()
new_buffer.finish_writing()
self.buffers.append(new_buffer)
def add (self, line):
buffer = FileBuffer()
buffer.add(line)
buffer.finish_writing()
self.buffers.append(buffer)
while 1 < len(self.buffers) and self.buffers[-2].size < 1.2 * self.buffers[-1].size:
self.merge()
def finish_writing(self):
while 2 < len(self.buffers):
self.merge()
def sorted_buffer(self):
self.finish_writing()
if len(self.buffers):
return self.buffers[0]
else:
buffer = FileBuffer()
buffer.state = 'done'
return buffer
def __iter__(self):
return self.sorted_buffer()
sorter = BufferedSort()
with open("demo.txt") as fh:
for line in fh:
sorter.add(line)
for line in sorter:
print(line, end="")
I would like to write a class with the following interface.
class Automaton:
""" A simple automaton class """
def iterate(self, something):
""" yield something and expects some result in return """
print("Yielding", something)
result = yield something
print("Got \"" + result + "\" in return")
return result
def start(self, somefunction):
""" start the iteration process """
yield from somefunction(self.iterate)
raise StopIteration("I'm done!")
def first(iterate):
while iterate("what do I do?") != "over":
continue
def second(iterate):
value = yield from iterate("what do I do?")
while value != "over":
value = yield from iterate("what do I do?")
# A simple driving process
automaton = Automaton()
#generator = automaton.start(first) # This one hangs
generator = automaton.start(second) # This one runs smoothly
next_yield = generator.__next__()
for step in range(4):
next_yield = generator.send("Continue...({})".format(step))
try:
end = generator.send("over")
except StopIteration as excp:
print(excp)
The idea is that Automaton will regularly yield values to the caller which will in turn send results/commands back to the Automaton.
The catch is that the decision process "somefunction" will be some user defined function I have no control over. Which means that I can't really expect it to call the iterate method will a yield from in front. Worst, it could be that the user wants to plug some third-party function he has no control over inside this Automaton class. Meaning that the user might not be able to rewrite his somefunction for it to include yield from in front of iterate calls.
To be clear: I completely understand why using the first function hangs the automaton. I am just wondering if there is a way to alter the definition of iterate or start that would make the first function work.
What I mean by "forkable iterator" - it is a regular iterator with method fork() which creates a new iterator which iterates from the current point of iteration of original iterator. And even if the original iterator was iterated further, fork will stay at the point where it was forked, until it itself will not be iterated over.
My practical use case:
I have a socket connection, and some "packets" that sent through it. Connection can be shared between "receivers" and each "packet" can be addressed to some "receiver". "Packets" can come in unordered way, so each "receiver" can potentially receive packet for different "recevier". And more than that - if one "receiver" received "packet" for different "recevier", this "different receiver" must still be able to read that packet.
So for that I want to implement such forkable iterator, which will represent the connection, and each receiver will make own fork, read it and search for "packets" addressed for it.
Does somebody know any implementations of what I'm talking about?
You are looking for the itertools.tee() function:
Return n independent iterators from a single iterable.
Do take into account that the implementation will buffer data to service all child iterators:
This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored).
Also, you should only use the returned child iterators; iterating over the source iterator will not propagate the data to the tee() iterables.
Thats my current implementation of forkable iterator:
#!/usr/bin/env python
# coding=utf-8
from collections import Iterator, deque
import threading
class ForkableIterator(Iterator):
def __init__(self, iterator, buffer=None, *args, **kwargs):
self.iterator = iter(iterator)
if buffer is None:
self.buffer = deque()
else:
self.buffer = buffer
args = iter(args)
self.refs = kwargs.get('refs', next(args, {}))
self.refs.setdefault('base', 0)
self.pointer = kwargs.get('pointer', next(args, 0))
self.lock = kwargs.get('lock', next(args, threading.Lock()))
#property
def pointer(self):
return self.refs[self] + self.refs['base']
#pointer.setter
def pointer(self, value):
self.refs[self] = value
def __del__(self):
del self.refs[self]
def __iter__(self):
return self
def next(self):
with self.lock:
if len(self.buffer) - self.pointer == 0:
elem = next(self.iterator)
self.buffer.append(elem)
else:
if self.pointer == min(self.refs.itervalues()):
elem = self.buffer.popleft()
self.refs['base'] -= 1
else:
elem = self.buffer[self.pointer]
self.pointer += 1
return elem
def fork(self):
return self.__class__(self.iterator, self.buffer,
refs=self.refs, pointer=self.pointer,
lock=self.lock)
I know Python has some lazy implementations, and as such, I was wondering if it is possible to use circular programming in Python.
If it isn't, why?
I think you mean co-routines, not co-recursion. Yes, it's perfectly possible in Python, since PEP 342: Coroutines via Enhanced Generators has been implemented.
The canonical example is the consumer decorator:
def consumer(func):
def wrapper(*args,**kw):
gen = func(*args, **kw)
next(gen)
return gen
wrapper.__name__ = func.__name__
wrapper.__dict__ = func.__dict__
wrapper.__doc__ = func.__doc__
return wrapper
Using such consumer then let's you chain filters and push information through them, acting as a pipeline:
from itertools import product
#consumer
def thumbnail_pager(pagesize, thumbsize, destination):
while True:
page = new_image(pagesize)
rows, columns = pagesize / thumbsize
pending = False
try:
for row, column in product(range(rows), range(columns)):
thumb = create_thumbnail((yield), thumbsize)
page.write(
thumb, col * thumbsize.x, row * thumbsize.y
)
pending = True
except GeneratorExit:
# close() was called, so flush any pending output
if pending:
destination.send(page)
# then close the downstream consumer, and exit
destination.close()
return
else:
# we finished a page full of thumbnails, so send it
# downstream and keep on looping
destination.send(page)
#consumer
def jpeg_writer(dirname):
fileno = 1
while True:
filename = os.path.join(dirname,"page%04d.jpg" % fileno)
write_jpeg((yield), filename)
fileno += 1
# Put them together to make a function that makes thumbnail
# pages from a list of images and other parameters.
#
def write_thumbnails(pagesize, thumbsize, images, output_dir):
pipeline = thumbnail_pager(
pagesize, thumbsize, jpeg_writer(output_dir)
)
for image in images:
pipeline.send(image)
pipeline.close()
The central principles are python generators, and yield expressions; the latter lets a generator receive information from a caller.
Edit: Ah, Co-recursion is indeed a different concept. Note that the Wikipedia article uses python for it's examples, and moreover, uses python generators.
Did you try it?
def a(x):
if x == 1: return
print "a", x
b(x - 1)
def b(x):
if x == 1: return
print "b", x
a(x - 1)
a(10)
As a side note, python does not have tail recursion, and this will fail for x > 1000 (although this limit is configurable)
I'm writing a Python generator which looks like "cat". My specific use case is for a "grep like" operation. I want it to be able to break out of the generator if a condition is met:
summary={}
for fn in cat("filelist.dat"):
for line in cat(fn):
if line.startswith("FOO"):
summary[fn] = line
break
So when break happens, I need the cat() generator to finish and close the file handle to fn.
I have to read 100k files with 30 GB of total data, and the FOO keyword happens in the header region, so it is important in this case that the cat() function stops reading the file ASAP.
There are other ways I can solve this problem, but I'm still interested to know how to get an early exit from a generator which has open file handles. Perhaps Python cleans them up right away and closes them when the generator is garbage collected?
Thanks,
Ian
Generators have a close method that raises GeneratorExit at the yield statement. If you specifically catch this exception, you can run some tear-down code:
import contextlib
with contextlib.closing( cat( fn ) ):
...
and then in cat:
try:
...
except GeneratorExit:
# close the file
If you'd like a simpler way to do this (without using the arcane close method on generators), just make cat take a file-like object instead of a string to open, and handle the file IO yourself:
for filename in filenames:
with open( filename ) as theFile:
for line in cat( theFile ):
...
However, you basically don't need to worry about any of this, because the garbage collection will handle it all. Still,
explicit is better than implicit
By implementing the context protocol and the iterator protocol in the same object, you can write pretty sweet code like this:
with cat("/etc/passwd") as lines:
for line in lines:
if "mail" in line:
print line.strip()
break
This is a sample implementation, tested with Python 2.5 on a Linux box. It reads the lines of /etc/passwd until it finds the one for user audio, and then stops:
from __future__ import with_statement
class cat(object):
def __init__(self, fname):
self.fname = fname
def __enter__(self):
print "[Opening file %s]" % (self.fname,)
self.file_obj = open(self.fname, "rt")
return self
def __exit__(self, *exc_info):
print "[Closing file %s]" % (self.fname,)
self.file_obj.close()
def __iter__(self):
return self
def next(self):
line = self.file_obj.next().strip()
print "[Read: %s]" % (line,)
return line
def main():
with cat("/etc/passwd") as lines:
for line in lines:
if "mail" in line:
print line.strip()
break
if __name__ == "__main__":
import sys
sys.exit(main())
Or even simpler:
with open("/etc/passwd", "rt") as f:
for line in f:
if "mail" in line:
break
File objects implement the iterator protocol (see http://docs.python.org/library/stdtypes.html#file-objects)
Please also consider this example:
def itertest():
try:
for i in xrange(1000):
print i
yield i
finally:
print 'finally'
x = itertest()
for i in x:
if i > 2:
break
print 'del x'
del x
print 'exit'
0
1
2
3
del x
finally
exit
It shows that finally is run after the iterator is cleaned up. I think __del__(self) is calling self.close(), see also here: https://docs.python.org/2.7/reference/expressions.html#generator.close
There seems to be another possibility using try..finally (tested on Python 2.7.6):
def gen():
i = 0
try:
while True:
print 'yield %i' % i
yield i
i += 1
print 'will never get here'
finally:
print 'done'
for i in gen():
if i > 1:
print 'break'
break
print i
Gives me the following printout:
yield 0
0
yield 1
1
yield 2
break
done