I find it difficult to articulate smoothly chained iterators and resource management in Python.
It will probably be clearer by examining a concrete example:
I have this little program that works on a bunch of similar, yet different csv files. As they are shared with other co-workers, I need to open and close them frequently. Moreover, I need to transform and filter their content. So I have a lot of different fonctions of this kind:
def doSomething(fpath):
with open(fpath) as fh:
r=csv.reader(fh, delimiter=';')
s=imap(lambda row: fn(row), r)
t=ifilter(lambda row: test(row), s)
for row in t:
doTheThing(row)
That's nice and readable, but, as I said, I have a lot of those and I end up copy-pasting a lot more than I'd wish. But of course I can't refactor the common code into a function returning an iterator:
def iteratorOver(fpath):
with open(fpath) as fh:
r=csv.reader(fh, delimiter=';')
return r #oops! fh is closed by the time I use it
A first step to refactor the code would be to create another 'with-enabled' class:
def openCsv(fpath):
class CsvManager(object):
def __init__(self, fpath):
self.fh=open(fpath)
def __enter__(self):
return csv.reader(self.fh, delimiter=';')
def __exit__(self, type, value, traceback):
self.fh.close()
and then:
with openCsv('a_path') as r:
s=imap(lambda row: fn(row), r)
t=ifilter(lambda row: test(row), s)
for row in t:
doTheThing(row)
But I only reduced the boilerplate of each function by one step.
So what is the pythonic way to refactor such a code? My c++ background is getting in the way I think.
You can use generators; these produce an iterable you can then pass to other objects. For example, a generator yielding all the rows in a CSV file:
def iteratorOver(fpath):
with open(fpath) as fh:
r = csv.reader(fh, delimiter=';')
for row in r:
yield row
Because a generator function pauses whenever you are not iterating over it, the function doesn't exit until the loop is complete and the with statement won't close the file.
You can now use that generator in a filter:
rows = iteratorOver('some path')
filtered = ifilter(test, rows)
etc.
Related
I am working on an assignment where I create "instances" of cities using rows in a .csv, then use these instances in methods to calculate distance and population change. Creating the instances works fine (using steps 1-4 below), until I try to call printDistance:
##Step 1. Open and read CityPop.csv
with open('CityPop.csv', 'r', newline='') as f:
try:
reader = csv.DictReader(f)
##Step 2. Create "City" class
class City:
##Step 3. Use _init method to assign attribute values
def __init__(self, row, header):
self.__dict__ = dict(zip(header, row))
##Step 4. Create "Cities" list
data = list(csv.reader(open('CityPop.csv')))
instances = [City(i, data[0]) for i in data[1:]]
##Step 5. Create printDistance method within "Cities" class
def printDistance(self, othercity, instances):
dist=math.acos((math.sin(math.radians(self.lat)))*(math.sin(math.radians(othercity.lat)))+(math.cos(math.radians(self.lat)))*(math.cos(math.radians(othercity.lat)))*(math.cos(math.radians(self.lon-othercity.lon)))) * 6300 (self.lat, self.lon, othercity.lat, othercity.lon)
When I enter instances[0].printDistance(instances1) in the shell, I get the error:
`NameError: name 'instances' is not defined`
Is this an indentation problem? Should I be calling the function from within the code, not the shell?
Nested functions must not contain self as parameter because they are not member functions. Class cannot pass instance variables to them. You are infact passing the same self from parent to child function.
Also you must not nest constructor, this is only for initiation purpose. Create a separate method indeed.
And try creating instance variable inside the constructor, and that is what init for !
self.instances = [self.getInstance(i, data[0]) for i in data[1:]]
Also create seperate function for instantiation
#classmethod
def getInstance(cls,d1,d2):
return cls(d1,d2)
This is not so much an indentation problem, but more of a general code structure problem. You're nesting a lot:
All the actual work on an incredibly long line (with errors)
Inside of function (correctly) printDistance
Inside of a constructor __init__
Inside of a class definition (correctly) City
Inside of a try block
Inside of a with block
I think this is what you are trying to do:
create a class City, which can print the distance of itself to other cities
generate a list of these City objects from a .csv that somehow has both distances and population (you should probably provide an example of data)
do so in a fault-tolerant and clean way (hence the try and the with)
The reason your instances isn't working is because, unlike you think, it's probably not being created correctly, or at least not in the correct context. And it certainly won't be available to you on the CLI due to all of the nesting.
There's a number of blatant bugs in your code:
What's the (self.lat, self.lon, othercity.lat, othercity.lon) at the end of the last line?
Why are you opening the file for reading twice? You're not even using the first reader
You are bluntly assigning column headers from a .csv as object attributes, but are misspelling their use (lat instead of latitude and lon instead of longitude)
It looks a bit like a lot of code found in various places got pasted together into one clump - this is what it looks like when cleaned up:
import csv
import math
class City:
def print_distance(self, other_city):
print(f'{self.city} to {other_city.city}')
# what a mess...
print(math.acos(
(math.sin(math.radians(float(self.latitude)))) * (math.sin(math.radians(float(other_city.latitude)))) + (
math.cos(math.radians(float(self.latitude)))) * (math.cos(math.radians(float(other_city.latitude)))) * (
math.cos(math.radians(float(self.longitude) - float(other_city.longitude))))) * 6300)
def __init__(self, values, attribute_names):
# this is *nasty* - much better to add the attributes explicitly, but left as original
# also, note that you're reading strings and floats here, but they are all stored as str
self.__dict__ = dict(zip(attribute_names, values))
with open('CityPop.csv', 'r', newline='') as f:
try:
reader = csv.reader(f)
header = next(reader)
cities = [City(row, header) for row in reader]
for city_1 in cities:
for city_2 in cities:
city_1.print_distance(city_2)
except Exception as e:
print(f'Apparently were doing something with this error: {e}')
Note how print_distance is now a method of City, which is called on each instance of City in cities (which is what I renamed instances to).
Now, if you are really trying, this makes more sense:
import csv
import math
class City:
def print_distance(self, other_city):
print(f'{self.name} to {other_city.name}')
# not a lot better, but some at least
print(
math.acos(
math.sin(math.radians(self.lat)) *
math.sin(math.radians(other_city.lat))
+
math.cos(math.radians(self.lat)) *
math.cos(math.radians(other_city.lat)) *
math.cos(math.radians(self.lon - other_city.lon))
) * 6300
)
def __init__(self, lat, lon, name):
self.lat = float(lat)
self.lon = float(lon)
self.name = str(name)
try:
with open('CityPop.csv', 'r', newline='') as f:
reader = csv.reader(f)
header = next(reader)
cities = [City(lat=row[1], lon=row[2], name=row[4]) for row in reader]
for city_1 in cities:
for city_2 in cities:
city_1.print_distance(city_2)
except FileNotFoundError:
print(f'Could not find the input file.')
Note the cleaned up computation, the catching of an error that could be expected to occur (with the with insides the try block) and a proper constructor that assigns what it needs with the correct type, while the reader decides which fields go where.
Finally, as a bonus: nobody should be writing distance calculations like this. Plenty libraries exist that do a much better job of this, like GeoPy. All you need to do is pip install geopy to get it and then you can use this:
import csv
import geopy.distance
class City:
def calc_distance(self, other_city):
return geopy.distance.geodesic(
(self.lat, self.lon),
(other_city.lat, other_city.lon)
).km
def __init__(self, lat, lon, name):
self.lat = float(lat)
self.lon = float(lon)
self.name = str(name)
try:
with open('CityPop.csv', 'r', newline='') as f:
reader = csv.reader(f)
header = next(reader)
cities = [City(lat=row[1], lon=row[2], name=row[4]) for row in reader]
for city_1 in cities:
for city_2 in cities:
print(city_1.calc_distance(city_2))
except FileNotFoundError:
print(f'Could not find the input file.')
Note that I moved the print out of the method as well, since it makes more sense to calculate in the object and print outside it. The nice thing about all this is that the calculation now uses a proper geodesic (WGS-84) to do the calculation and the odds of math errors are drastically reduced. If you must use a simple sphere, the library has functions for that as well.
I am filtering some data in a pandas.DataFrame and want to track the rows I loose. So basically, I want to
df = pandas.read_csv(...)
n1 = df.shape[0]
df = ... # some logic that might reduce the number of rows
print(f'Lost {n1 - df.shape[0]} rows')
Now there are multiple of these filter steps, and the code before/after it is always the same. So I am looking for a way to abstract that away.
Of course the first thing that comes into mind are decorators - however, I don't like the idea of creating a bunch of functions with just one LOC.
What I came up with are context managers:
from contextlib import contextmanager
#contextmanager
def rows_lost(df):
try:
n1 = df.shape[0]
yield df
finally:
print(f'Lost {n1 - df.shape[0]} rows')
And then:
with rows_lost(df) as df:
df = ...
I am wondering whether there is a better solution to this?
Edit:
I just realized that the context manager approach does not work, if a filter step returns a new object (which is the default for pandas Dataframes). It only works when the objects are modified "in place".
You could write a "wrapper-function" that wraps the filter you specify:
def filter1(arg):
return arg+1
def filter2(arg):
return arg*2
def wrap_filter(arg, filter_func):
print('calculating with argument', arg)
result = filter_func(arg)
print('result', result)
return result
wrap_filter(5, filter1)
wrap_filter(5, filter2)
The only thing that this improves on using a decorator is that you can choose to call the filter without the wrapper...
If I call the company_at_node method (shown below) twice, it will only print a row for the first call. I thought maybe that I needed to seek back to the beginning of the reader for the next call, so I added
self.companies.seek(0)
to the end of the company_at_node method but DictReader has no attribute seek. Since the file is never closed (and since I didn't get an error message to that effect), I didn't think this was a ValueError i/o operation on closed file (which there are numerous questions about on SO)
Is there a way to return to the beginning of a DictReader to iterate through a second time (i.e. a second function call)?
class CSVReader:
def __init__(self):
f = open('myfile.csv')
self.companies = csv.DictReader(f)
def company_at_node(self, node):
for row in self.companies:
if row['nodeid'] == node:
print row
self.companies.seek(0)
You need to do f.seek(0) instead of DictReader. Then, you can modify your code to be able to access file. This should work:
class CSVReader:
def __init__(self):
self.f = open('myfile.csv')
self.companies = csv.DictReader(f)
def company_at_node(self, node):
for row in self.companies:
if row['nodeid'] == node:
print row
self.f.seek(0)
In reader = csv.DictReader(f) the instance reader is an iterator. An iterator emits a unit of data on each explicit/ implicit invocation of __next__ on it. Now that process is called consuming the iterator, which can happen only once. This is how the iterator construct provides the ultimate memory efficiency. So if you want random indexing make a sequence out of it like,
rows = list(reader)
I'm having trouble understanding the yield keyword.
I understand the effects in terms of what happens when the program gets executed, but I don't really understand how much memory it uses.
I'll try to explain my doubts using examples.
Let's say we have three functions:
HUGE_NUMBER = 9223372036854775807
def function1():
for i in range(0, HUGE_NUMBER):
yield i
def function2():
x = range(0, HUGE_NUMBER)
for i in x:
yield i
def function3(file):
with open(file, 'r') as f:
dictionary = dict(csv.reader(f, delimiter = ' '))
for k,v in dictionary.iteritems():
yield k,v
Does the huge range actually get stored in memory if I iterate over the generator returned by the first function?
What about the second function?
Would my program use less memory if I iterated over the generator returned by the third function (as opposed to just making that dictionary and iterating directly over it)?
The huge list produced by the Python 2 range() function will need to be stored, yes, and will take up memory, for the full lifetime of the generator function.
A generator function can be memory efficient provided the results it produces are calculated as needed, but the range() function produces all your results up front.
You could just calculate the next number:
def function1():
i = 0
while i < HUGE_NUMBER:
yield i
i += 1
and you'd get the same result, but you wouldn't be storing all numbers for the whole range in one go. This is essentially what looping over the xrange() object does; it calculates numbers as requested. (In Python 3 xrange() replaced range()).
The same applies for your function3; you read the whole file into a dictionary first, so that is still stored in memory for you as you iterate. There is no need to read the whole file into memory just to yield each element afterwards. You could just loop over the file and yield lines:
def function3(file):
seen = set()
with open(file, 'r') as f:
reader = csv.reader(f, delimiter = ' ')
for k, v in reader:
if k in seen:
# already seen
continue
seen.add(k)
yield k, v
This only stores keys seen to avoid duplicates (like the dictionary would) but the values are not stored. Memory increases as you iterate over the generator. If duplicates are not an issue, you could omit tracking seen keys altogether:
def function3(file):
with open(file, 'r') as f:
reader = csv.reader(f, delimiter = ' ')
for k, v in reader:
yield k, v
or even
def function3(file):
with open(file, 'r') as f:
reader = csv.reader(f, delimiter = ' ')
return reader
as the reader is iterable, after all.
The generator object contains a reference to the function's scope and by extension all local objects within it. The way to reduce memory usage is to use iterators at every level possible, not just at the top level.
If you want to check how much memory an object uses, you can follow this post as a proxy. I found it helpful.
"Try this:
sys.getsizeof(object)
getsizeof() calls the object’s sizeof method and adds an additional garbage collector overhead if the object is managed by the garbage collector."
A recursive recipe
I need to loop until I hit the end of a file-like object, but I'm not finding an "obvious way to do it", which makes me suspect I'm overlooking something, well, obvious. :-)
I have a stream (in this case, it's a StringIO object, but I'm curious about the general case as well) which stores an unknown number of records in "<length><data>" format, e.g.:
data = StringIO("\x07\x00\x00\x00foobar\x00\x04\x00\x00\x00baz\x00")
Now, the only clear way I can imagine to read this is using (what I think of as) an initialized loop, which seems a little un-Pythonic:
len_name = data.read(4)
while len_name != "":
len_name = struct.unpack("<I", len_name)[0]
names.append(data.read(len_name))
len_name = data.read(4)
In a C-like language, I'd just stick the read(4) in the while's test clause, but of course that won't work for Python. Any thoughts on a better way to accomplish this?
You can combine iteration through iter() with a sentinel:
for block in iter(lambda: file_obj.read(4), ""):
use(block)
Have you seen how to iterate over lines in a text file?
for line in file_obj:
use(line)
You can do the same thing with your own generator:
def read_blocks(file_obj, size):
while True:
data = file_obj.read(size)
if not data:
break
yield data
for block in read_blocks(file_obj, 4):
use(block)
See also:
file.read
I prefer the already mentioned iterator-based solution to turn this into a for-loop. Another solution written directly is Knuth's "loop-and-a-half"
while 1:
len_name = data.read(4)
if not len_name:
break
names.append(data.read(len_name))
You can see by comparison how that's easily hoisted into its own generator and used as a for-loop.
I see, as predicted, that the typical and most popular answer are using very specialized generators to "read 4 bytes at a time". Sometimes generality isn't any harder (and much more rewarding;-), so, I've suggested instead the following very general solution:
import operator
def funlooper(afun, *a, **k):
wearedone = k.pop('wearedone', operator.not_)
while True:
data = afun(*a, **k)
if wearedone(data): break
yield data
Now your desired loop header is just: for len_name in funlooper(data.read, 4):.
Edit: made much more general by the wearedone idiom since a comment accused my slightly less general previous version (hardcoding the exit test as if not data:) of having "a hidden dependency", of all things!-)
The usual swiss army knife of looping, itertools, is fine too, of course, as usual:
import itertools as it
for len_name in it.takewhile(bool, it.imap(data.read, it.repeat(4))): ...
or, quite equivalently:
import itertools as it
def loop(pred, fun, *args):
return it.takewhile(pred, it.starmap(fun, it.repeat(args)))
for len_name in loop(bool, data.read, 4): ...
The EOF marker in python is an empty string so what you have is pretty close to the best you are going to get without writing a function to wrap this up in an iterator. I could be written in a little more pythonic way by changing the while like:
while len_name:
len_name = struct.unpack("<I", len_name)[0]
names.append(data.read(len_name))
len_name = data.read(4)
I'd go with Tendayi's suggestion re function and iterator for readability:
def read4():
len_name = data.read(4)
if len_name:
len_name = struct.unpack("<I", len_name)[0]
return data.read(len_name)
else:
raise StopIteration
for d in iter(read4, ''):
names.append(d)