I'm having trouble understanding the yield keyword.
I understand the effects in terms of what happens when the program gets executed, but I don't really understand how much memory it uses.
I'll try to explain my doubts using examples.
Let's say we have three functions:
HUGE_NUMBER = 9223372036854775807
def function1():
for i in range(0, HUGE_NUMBER):
yield i
def function2():
x = range(0, HUGE_NUMBER)
for i in x:
yield i
def function3(file):
with open(file, 'r') as f:
dictionary = dict(csv.reader(f, delimiter = ' '))
for k,v in dictionary.iteritems():
yield k,v
Does the huge range actually get stored in memory if I iterate over the generator returned by the first function?
What about the second function?
Would my program use less memory if I iterated over the generator returned by the third function (as opposed to just making that dictionary and iterating directly over it)?
The huge list produced by the Python 2 range() function will need to be stored, yes, and will take up memory, for the full lifetime of the generator function.
A generator function can be memory efficient provided the results it produces are calculated as needed, but the range() function produces all your results up front.
You could just calculate the next number:
def function1():
i = 0
while i < HUGE_NUMBER:
yield i
i += 1
and you'd get the same result, but you wouldn't be storing all numbers for the whole range in one go. This is essentially what looping over the xrange() object does; it calculates numbers as requested. (In Python 3 xrange() replaced range()).
The same applies for your function3; you read the whole file into a dictionary first, so that is still stored in memory for you as you iterate. There is no need to read the whole file into memory just to yield each element afterwards. You could just loop over the file and yield lines:
def function3(file):
seen = set()
with open(file, 'r') as f:
reader = csv.reader(f, delimiter = ' ')
for k, v in reader:
if k in seen:
# already seen
continue
seen.add(k)
yield k, v
This only stores keys seen to avoid duplicates (like the dictionary would) but the values are not stored. Memory increases as you iterate over the generator. If duplicates are not an issue, you could omit tracking seen keys altogether:
def function3(file):
with open(file, 'r') as f:
reader = csv.reader(f, delimiter = ' ')
for k, v in reader:
yield k, v
or even
def function3(file):
with open(file, 'r') as f:
reader = csv.reader(f, delimiter = ' ')
return reader
as the reader is iterable, after all.
The generator object contains a reference to the function's scope and by extension all local objects within it. The way to reduce memory usage is to use iterators at every level possible, not just at the top level.
If you want to check how much memory an object uses, you can follow this post as a proxy. I found it helpful.
"Try this:
sys.getsizeof(object)
getsizeof() calls the object’s sizeof method and adds an additional garbage collector overhead if the object is managed by the garbage collector."
A recursive recipe
Related
I find it difficult to articulate smoothly chained iterators and resource management in Python.
It will probably be clearer by examining a concrete example:
I have this little program that works on a bunch of similar, yet different csv files. As they are shared with other co-workers, I need to open and close them frequently. Moreover, I need to transform and filter their content. So I have a lot of different fonctions of this kind:
def doSomething(fpath):
with open(fpath) as fh:
r=csv.reader(fh, delimiter=';')
s=imap(lambda row: fn(row), r)
t=ifilter(lambda row: test(row), s)
for row in t:
doTheThing(row)
That's nice and readable, but, as I said, I have a lot of those and I end up copy-pasting a lot more than I'd wish. But of course I can't refactor the common code into a function returning an iterator:
def iteratorOver(fpath):
with open(fpath) as fh:
r=csv.reader(fh, delimiter=';')
return r #oops! fh is closed by the time I use it
A first step to refactor the code would be to create another 'with-enabled' class:
def openCsv(fpath):
class CsvManager(object):
def __init__(self, fpath):
self.fh=open(fpath)
def __enter__(self):
return csv.reader(self.fh, delimiter=';')
def __exit__(self, type, value, traceback):
self.fh.close()
and then:
with openCsv('a_path') as r:
s=imap(lambda row: fn(row), r)
t=ifilter(lambda row: test(row), s)
for row in t:
doTheThing(row)
But I only reduced the boilerplate of each function by one step.
So what is the pythonic way to refactor such a code? My c++ background is getting in the way I think.
You can use generators; these produce an iterable you can then pass to other objects. For example, a generator yielding all the rows in a CSV file:
def iteratorOver(fpath):
with open(fpath) as fh:
r = csv.reader(fh, delimiter=';')
for row in r:
yield row
Because a generator function pauses whenever you are not iterating over it, the function doesn't exit until the loop is complete and the with statement won't close the file.
You can now use that generator in a filter:
rows = iteratorOver('some path')
filtered = ifilter(test, rows)
etc.
If I call the company_at_node method (shown below) twice, it will only print a row for the first call. I thought maybe that I needed to seek back to the beginning of the reader for the next call, so I added
self.companies.seek(0)
to the end of the company_at_node method but DictReader has no attribute seek. Since the file is never closed (and since I didn't get an error message to that effect), I didn't think this was a ValueError i/o operation on closed file (which there are numerous questions about on SO)
Is there a way to return to the beginning of a DictReader to iterate through a second time (i.e. a second function call)?
class CSVReader:
def __init__(self):
f = open('myfile.csv')
self.companies = csv.DictReader(f)
def company_at_node(self, node):
for row in self.companies:
if row['nodeid'] == node:
print row
self.companies.seek(0)
You need to do f.seek(0) instead of DictReader. Then, you can modify your code to be able to access file. This should work:
class CSVReader:
def __init__(self):
self.f = open('myfile.csv')
self.companies = csv.DictReader(f)
def company_at_node(self, node):
for row in self.companies:
if row['nodeid'] == node:
print row
self.f.seek(0)
In reader = csv.DictReader(f) the instance reader is an iterator. An iterator emits a unit of data on each explicit/ implicit invocation of __next__ on it. Now that process is called consuming the iterator, which can happen only once. This is how the iterator construct provides the ultimate memory efficiency. So if you want random indexing make a sequence out of it like,
rows = list(reader)
i wrote a class inheriting from dict, i wrote a member method to remove objects.
class RoleCOList(dict):
def __init__(self):
dict.__init__(self)
def recyle(self):
'''
remove roles too long no access
'''
checkTime = time.time()-60*30
l = [k for k,v in self.items() if v.lastAccess>checkTime]
for x in l:
self.pop(x)
isn't it too inefficient? i used 2 list loops but i couldn't find other way
At the SciPy conference last year, I attended a talk where the speaker said that any() and all() are fast ways to do a task in a loop. It makes sense; a for loop rebinds the loop variable on each iteration, whereas any() and all() simply consume the value.
Clearly, you use any() when you want to run a function that always returns a false value such as None. That way, the whole loop will run to the end.
checkTime = time.time() - 60*30
# use any() as a fast way to run a loop
# The .__delitem__() method always returns `None`, so this runs the whole loop
lst = [k for k in self.keys() if self[k].lastAccess > checkTime]
any(self.__delitem__(k) for k in lst)
what about this?
_ = [self.pop(k) for k,v in self.items() if v.lastAccess>checkTime]
Since you don't need the list you generated, you could use generators and a snippet from this consume recipe. In particular, use collections.deque to run through a generator for you.
checkTime = time.time()-60*30
# Create a generator for all the values you will age off
age_off = (self.pop(k) for k in self.keys() if self[k].lastAccess>checkTime)
# Let deque handle iteration (in one shot, with little memory footprint)
collections.deque(age_off,maxlen=0)
Since the dictionary is changed during the iteration of age_off, use self.keys() which returns a list. (Using self.iteritems() will raise a RuntimeError.)
My (completly unreadable solution):
from operator import delitem
map(lambda k: delitem(self,k), filter(lambda k: self[k].lastAccess<checkTime, iter(self)))
but at least it should be quite time and memory efficient ;-)
If performance is an issue, and if you will have large volumes of data, you might want to look into using a Python front-end for a system like memcached or redis; those can handle expiring old data for you.
http://memcached.org/
http://pypi.python.org/pypi/python-memcached/
http://redis.io/
https://github.com/andymccurdy/redis-py
Say I have something like the following:
dest = "\n".join( [line for line in src.split("\n") if line[:1]!="#"] )
(i.e. strip any lines starting with # from the multi-line string src)
src is very large, so I'm assuming .split() will create a large intermediate list. I can change the list comprehension to a generator expression, but is there some kind of "xsplit" I can use to only work on one line at a time? Is my assumption correct? What's the most (memory) efficient way to handle this?
Clarification: This arose due to my code running out of memory. I know there are ways to entirely rewrite my code to work around that, but the question is about Python: Is there a version of split() (or an equivalent idiom) that behaves like a generator and hence doesn't make an additional working copy of src?
Here's a way to do a general type of split using itertools
>>> import itertools as it
>>> src="hello\n#foo\n#bar\n#baz\nworld\n"
>>> line_gen = (''.join(j) for i,j in it.groupby(src, "\n".__ne__) if i)
>>> '\n'.join(s for s in line_gen if s[0]!="#")
'hello\nworld'
groupby treats each char in src separately, so the performance probably isn't stellar, but it does avoid creating any intermediate huge data structures
Probably better to spend a few lines and make a generator
>>> src="hello\n#foo\n#bar\n#baz\nworld\n"
>>>
>>> def isplit(s, t): # iterator to split string s at character t
... i=j=0
... while True:
... try:
... j = s.index(t, i)
... except ValueError:
... if i<len(s):
... yield s[i:]
... raise StopIteration
... yield s[i:j]
... i = j+1
...
>>> '\n'.join(x for x in isplit(src, '\n') if x[0]!='#')
'hello\nworld'
re has a method called finditer, that could be used for this purpose too
>>> import re
>>> src="hello\n#foo\n#bar\n#baz\nworld\n"
>>> line_gen = (m.group(1) for m in re.finditer("(.*?)(\n|$)",src))
>>> '\n'.join(s for s in line_gen if not s.startswith("#"))
'hello\nworld'
comparing the performance is an exercise for the OP to try on the real data
buffer = StringIO(src)
dest = "".join(line for line in buffer if line[:1]!="#")
Of course, this really makes the most sense if you use StringIO throughout. It works mostly the same as files. You can seek, read, write, iterate (as shown), etc.
In your existing code you can change the list to a generator expression:
dest = "\n".join(line for line in src.split("\n") if line[:1]!="#")
This very small change avoids the construction of one of the two temporary lists in your code, and requires no effort on your part.
A completely different approach that avoids the temporary construction of both lists is to use a regular expression:
import re
regex = re.compile('^#.*\n?', re.M)
dest = regex.sub('', src)
This will not only avoid creating temporary lists, it will also avoid creating temporary strings for each line in the input. Here are some performance measurements of the proposed solutions:
init = r'''
import re, StringIO
regex = re.compile('^#.*\n?', re.M)
src = ''.join('foo bar baz\n' for _ in range(100000))
'''
method1 = r'"\n".join([line for line in src.split("\n") if line[:1] != "#"])'
method2 = r'"\n".join(line for line in src.split("\n") if line[:1] != "#")'
method3 = 'regex.sub("", src)'
method4 = '''
buffer = StringIO.StringIO(src)
dest = "".join(line for line in buffer if line[:1] != "#")
'''
import timeit
for method in [method1, method2, method3, method4]:
print timeit.timeit(method, init, number = 100)
Results:
9.38s # Split then join with temporary list
9.92s # Split then join with generator
8.60s # Regular expression
64.56s # StringIO
As you can see the regular expression is the fastest method.
From your comments I can see that you are not actually interested in avoiding creating temporary objects. What you really want is to reduce the memory requirements for your program. Temporary objects don't necessarily affect the memory consumption of your program as Python is good about clearing up memory quickly. The problem comes from having objects that persist in memory longer than they need to, and all these methods have this problem.
If you are still running out of memory then I'd suggest that you shouldn't be doing this operation entirely in memory. Instead store the input and output in files on the disk and read from them in a streaming fashion. This means that you read one line from the input, write a line to the output, read a line, write a line, etc. This will create lots of temporary strings but even so it will require almost no memory because you only need to handle the strings one at a time.
If I understand your question about "more generic calls to split()" correctly, you could use re.finditer, like so:
output = ""
for i in re.finditer("^.*\n",input,re.M):
i=i.group(0).strip()
if i.startswith("#"):
continue
output += i + "\n"
Here you can replace the regular expression by something more sophisticated.
The problem is that strings are immutable in python, so it's going to be very difficult to do anything at all without intermediate storage.
I need to loop until I hit the end of a file-like object, but I'm not finding an "obvious way to do it", which makes me suspect I'm overlooking something, well, obvious. :-)
I have a stream (in this case, it's a StringIO object, but I'm curious about the general case as well) which stores an unknown number of records in "<length><data>" format, e.g.:
data = StringIO("\x07\x00\x00\x00foobar\x00\x04\x00\x00\x00baz\x00")
Now, the only clear way I can imagine to read this is using (what I think of as) an initialized loop, which seems a little un-Pythonic:
len_name = data.read(4)
while len_name != "":
len_name = struct.unpack("<I", len_name)[0]
names.append(data.read(len_name))
len_name = data.read(4)
In a C-like language, I'd just stick the read(4) in the while's test clause, but of course that won't work for Python. Any thoughts on a better way to accomplish this?
You can combine iteration through iter() with a sentinel:
for block in iter(lambda: file_obj.read(4), ""):
use(block)
Have you seen how to iterate over lines in a text file?
for line in file_obj:
use(line)
You can do the same thing with your own generator:
def read_blocks(file_obj, size):
while True:
data = file_obj.read(size)
if not data:
break
yield data
for block in read_blocks(file_obj, 4):
use(block)
See also:
file.read
I prefer the already mentioned iterator-based solution to turn this into a for-loop. Another solution written directly is Knuth's "loop-and-a-half"
while 1:
len_name = data.read(4)
if not len_name:
break
names.append(data.read(len_name))
You can see by comparison how that's easily hoisted into its own generator and used as a for-loop.
I see, as predicted, that the typical and most popular answer are using very specialized generators to "read 4 bytes at a time". Sometimes generality isn't any harder (and much more rewarding;-), so, I've suggested instead the following very general solution:
import operator
def funlooper(afun, *a, **k):
wearedone = k.pop('wearedone', operator.not_)
while True:
data = afun(*a, **k)
if wearedone(data): break
yield data
Now your desired loop header is just: for len_name in funlooper(data.read, 4):.
Edit: made much more general by the wearedone idiom since a comment accused my slightly less general previous version (hardcoding the exit test as if not data:) of having "a hidden dependency", of all things!-)
The usual swiss army knife of looping, itertools, is fine too, of course, as usual:
import itertools as it
for len_name in it.takewhile(bool, it.imap(data.read, it.repeat(4))): ...
or, quite equivalently:
import itertools as it
def loop(pred, fun, *args):
return it.takewhile(pred, it.starmap(fun, it.repeat(args)))
for len_name in loop(bool, data.read, 4): ...
The EOF marker in python is an empty string so what you have is pretty close to the best you are going to get without writing a function to wrap this up in an iterator. I could be written in a little more pythonic way by changing the while like:
while len_name:
len_name = struct.unpack("<I", len_name)[0]
names.append(data.read(len_name))
len_name = data.read(4)
I'd go with Tendayi's suggestion re function and iterator for readability:
def read4():
len_name = data.read(4)
if len_name:
len_name = struct.unpack("<I", len_name)[0]
return data.read(len_name)
else:
raise StopIteration
for d in iter(read4, ''):
names.append(d)