Do all iterators cache? How about csv.Reader? - python

We know the following code is loading the data line-by-line only rather than loading them all in memory. i.e. the line alread read will be somehow marked 'deletable' for the OS
def fileGen( file ):
for line in file:
yield line
with open("somefile") as file:
for line in fileGen( file ):
print line
but is there anyway we could verify if this is still true if we modify the definition of fileGen to following?
def fileGen( file ):
for line in csv.Reader( file ):
yield line
How we could know if csv.Reader will cache the data it loaded? thanks
regards,
John

The most reliable way to find out what csv.reader is doing is to read the source. See _csv.c, lines 773 onwards. You'll see that the reader object has a pointer to the underlying iterator (typically a file iterator), and it calls PyIter_Next each time it needs another line. So it does not read ahead or otherwise cache the data it loads.
Another way to find out what csv.reader is doing is to make a mock file object that can report when it is being queried. For example:
class MockFile:
def __init__(self): self.line = 0
def __iter__(self): return self
def next(self):
self.line += 1
print "MockFile line", self.line
return "line,{0}".format(self.line)
>>> r = csv.reader(MockFile())
>>> next(r)
MockFile line 1
['line', '1']
>>> next(r)
MockFile line 2
['line', '2']
This confirms what we learned from reading the csv source code: it only requests the next line from the underlying iterator when its own next method is called.
John made it clear (see comments) that his concern is whether csv.reader keeps the lines alive, preventing them from being collected by Python's memory manager.
Again, you can either read the code (most reliable) or try an experiment. If you look at the implementation of Reader_iternext in _csv.c, you'll see that lineobj is the name given to the object returned by the underlying iterator, and there's a call to Py_DECREF(lineobj) on every path through the code. So csv.reader does not keep lineobj alive.
Here's an experiment to confirm that.
class FinalizableString(string):
"""A string that reports its deletion."""
def __init__(self, s): self.s = s
def __str__(self): return self.s
def __del__(self): print "*** Deleting", self.s
class MockFile:
def __init__(self): self.line = 0
def __iter__(self): return self
def next(self):
self.line += 1
return FinalizableString("line,{0}".format(self.line))
>>> r = csv.reader(MockFile())
>>> next(r)
*** Deleting line,1
['line', '1']
>>> next(r)
*** Deleting line,2
['line', '2']
So you can see that csv.reader does not hang on to the objects it gets from its iterator, and if nothing else is keeping them alive, then they get garbage-collected in a timely fashion.
I have a feeling that there's something more to this question that you're not telling us. Can you explain why you are worried about this?

Related

Reading a text using python iterator __next__ but it is stopping after reading first line

I have a below code
class read_file():
#constructor taking a file name
def __init__(self, filename):
self.__file=open(filename)
self.__current_line=[]
def __iter__(self):
return self
def __next__(self):
try:
if len(self.__current_line)==0:
self.__current_line=self.__file.readline().strip().split()
word=self.__current_line.pop(0)
return word
except:
raise StopIteration
I am reading a multiple line text file. Problem is that this code is reading only first line of the file. No other lines are printed. Any clue how to fix that
[Edit]: Code is stopping when it encounters blank line. It should not stop at blank line. What change is needed for this
Looking for answer in this

How to retrieve all the content of calls made to a mock?

I'm writing a unit test for a function that takes an array of dictionaries and ends up saving it in a CSV. I'm trying to mock it with pytest as usual:
csv_output = (
"Name\tSurname\r\n"
"Eve\tFirst\r\n"
)
with patch("builtins.open", mock_open()) as m:
export_csv_func(array_of_dicts)
assert m.assert_called_once_with('myfile.csv', 'wb') is None
[and here I want to gather all output sent to the mock "m" and assert it against "csv_output"]
I cannot get in any simple way all the data sent to the mock during the open() phase by csv to do the comparison in bulk, instead of line by line. To simplify things, I verified that the following code mimics the operations that export_csv_func() does to the mock:
with patch("builtins.open", mock_open()) as m:
with open("myfile.csv", "wb") as f:
f.write("Name\tSurname\r\n")
f.write("Eve\tFirst\r\n")
When I dig into the mock, I see:
>>> m
<MagicMock name='open' spec='builtin_function_or_method' id='4380173840'>
>>> m.mock_calls
[call('myfile.csv', 'wb'),
call().__enter__(),
call().write('Name\tSurname\r\n'),
call().write('Eve\tFirst\r\n'),
call().__exit__(None, None, None)]
>>> m().write.mock_calls
[call('Name\tSurname\r\n'), call('Eve\tFirst\r\n')]
>>> dir(m().write.mock_calls[0])
['__add__'...(many methods), '_mock_from_kall', '_mock_name', '_mock_parent', 'call_list', 'count', 'index']
I don't see anything in the MagickMock interface where I can gather all the input that the mock has received.
I also tried calling m().write.call_args but it only returns the last call (the last element of the mock_calls attribute, i.e. call('Eve\tFirst\r\n')).
Is there any way of doing what I want?
You can create your own mock.call objects and compare them with what you have in the .call_args_list.
from unittest.mock import patch, mock_open, call
with patch("builtins.open", mock_open()) as m:
with open("myfile.csv", "wb") as f:
f.write("Name\tSurname\r\n")
f.write("Eve\tFirst\r\n")
# Create your array of expected strings
expected_strings = ["Name\tSurname\r\n", "Eve\tFirst\r\n"]
write_calls = m().write.call_args_list
for expected_str in expected_strings:
# assert that a mock.call(expected_str) exists in the write calls
assert call(expected_str) in write_calls
Note that you can use the assert call of your choice. If you're in a unittest.TestCase subclass, prefer to use self.assertIn.
Additionally, if you just want the arg values you can unpack a mock.call object as tuples. Index 0 is the *args. For example:
for write_call in write_calls:
print('args: {}'.format(write_call[0]))
print('kwargs: {}'.format(write_call[1]))
Indeed you can't patch builtins.open.write directly since the patch within a with would need to enter the patched method and see that write is not a class method.
There are a bunch of solutions and the one I would think of first would be to use your own mock. See the example:
class MockOpenWrite:
def __init__(self, *args, **kwargs):
self.res = []
# What's actually mocking the write. Name must match
def write(self, s: str):
self.res.append(s)
# These 2 methods are needed specifically for the use of with.
# If you mock using a decorator, you don't need them anymore.
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
return
mock = MockOpenWrite
with patch("builtins.open", mock):
with open("myfile.csv", "w") as f:
f.write("Name\tSurname\r\n")
f.write("Eve\tFirst\r\n")
print(f.res)
In that case, the res attribute is linked to the instance. So it disappears after the with closes.
You could eventually stored results somewhere else, like a global array, and check the results beyond the end of with.
Feel free to play around with your actual method.
I had to it this way (Python 3.9). It was quite tedious just to get the mock-args out of the function.
from somewhere import my_thing
#patch("lib.function", return_value=MagicMock())
def test_my_thing(my_mock):
my_thing(value1, value2)
(value1_call_args, value2_call_args) = my_mock.call_args_list[0].args

Use decorators to retrieve jsondata if file exists, otherwise run method and then store output as json?

I've read a little bit about decorators without my puny brain understanding them fully, but I believe this is one of the cases where they would be of use.
I have a main method running some other methods:
def run_pipeline():
gene_sequence_fasta_files_made = create_gene_sequence_fasta_files()
....several other methods taking one input argument and having one output argument.
Since each method takes a long time to run, I'd like to store the result in a json object for each method. If the json file exists I load it, otherwise I run the method and store the result. My current solution looks like this:
def run_pipeline():
gene_sequence_fasta_files_made = _load_or_make(create_gene_sequence_fasta_files, "/home/myfolder/ff.json", method_input=None)
...
Problem is, I find this really ugly and hard to read. If it is possible, how would I use decorators to solve this problem?
Ps. sorry for not showing my attempts. I haven't tried anything since I'm working against a deadline for a client and do not have the time (I could deliver the code above; I just find it aesthetically displeasing).
Psps. definition of _load_or_make() appended:
def _load_or_make(method, filename, method_input=None):
try:
with open(filename, 'r') as input_handle:
data = json.load(input_handle)
except IOError:
if method_input == None:
data = method()
else:
data = method(method_input)
with open(filename, 'w+') as output_handle:
json.dump(data, output_handle)
return data
Here's a decorator that tries loading json from the given filename, and if it can't find the file or the json load fails, it runs the original function, writes the result as json to disk, and returns.
def load_or_make(filename):
def decorator(func):
def wraps(*args, **kwargs):
try:
with open(filename, 'r') as f:
return json.load(input_handle)
except Exception:
data = func(*args, **kwargs)
with open(filename, 'w') as out:
json.dump(data, out)
return data
return wraps
return decorator
#load_or_make(filename)
def your_method_with_arg(arg):
# do stuff
return data
#load_or_make(other_filename)
def your_method():
# do stuff
return data
Note that there is an issue with this approach: if the decorated method returns different values depending on the arguments passed to it, the cache won't behave properly. It looks like that isn't a requirement for you, but if it is, you'd need to pick a different filename depending on the arguments passed in (or use pickle-based serialization, and just pickle a dict of args -> results). Here's an example of how to do it using a pickle approach, (very) loosely based on the memoized decorator Christian P. linked to:
import pickle
def load_or_make(filename):
def decorator(func):
def wrapped(*args, **kwargs):
# Make a key for the arguments. Try to make kwargs hashable
# by using the tuple returned from items() instead of the dict
# itself.
key = (args, kwargs.items())
try:
hash(key)
except TypeError:
# Don't try to use cache if there's an
# unhashable argument.
return func(*args, **kwargs)
try:
with open(filename) as f:
cache = pickle.load(f)
except Exception:
cache = {}
if key in cache:
return cache[key]
else:
value = func(*args, **kwargs)
cache[key] = value
with open(filename, "w") as f:
pickle.dump(cache, f)
return value
return wrapped
return decorator
Here, instead of saving the result as json, we pickle the result as a value in a dict, where the key is the arguments provided to the function. Note that you would still need to use a different filename for every function you decorate to ensure you never got incorrect results from the cache.
Do you want to save the results to disk or is in-memory okay? If so, you can use the memoize decorator / pattern, found here: https://wiki.python.org/moin/PythonDecoratorLibrary#Memoize
For each set of unique input arguments, it saves the result from the function in memory. If the function is then called again with the same arguments, it returns the result from memory rather than trying to run the function again.
It can also be altered to allow for a timeout (depending on how long your program runs for) so that if called after a certain time, it should re-run and re-cache the results.
A decorator is simply a callable that takes a function (or a class) as an argument, does something with/to it, and returns something (usually the function in a wrapper, or the class modified or registered):
Since Flat is better than nested I like to use classes if the decorator is at all complex:
class GetData(object):
def __init__(self, filename):
# this is called on the #decorator line
self.filename = filename
self.method_input = input
def __call__(self, func):
# this is called by Python with the completed def
def wrapper(*args, **kwds):
try:
with open(self.filename) as stored:
data = json.load(stored)
except IOError:
data = func(*args, **kwds)
with open(self.filename, 'w+') as stored:
json.dump(data, stored)
return data
return wrapper
and in use:
#GetData('/path/to/some/file')
def create_gene_sequence_fasta_files('this', 'that', these='those'):
pass
#GetData('/path/to/some/other/file')
def create_gene_sequence_fastb_files():
pass
I am no expert in python's decorator.I just learn it from a tutorial.But i think this can help u.But u may not get more readablity from it.
Decorator is a way to give ur different function the similar solution to deal with things,without make ur code mess or losing their readablity.It seems like transparent to the rest of ur code.
def _load_or_make(filename):
def _deco(method):
def __deco():
try:
with open(filename, 'r') as input_handle:
data = json.load(input_handle)
return data
except IOError:
if method_input == None:
data = method()
else:
data = method(method_input)
with open(filename, 'w+') as output_handle:
json.dump(data, output_handle)
return data
return __deco
return _deco
#_load_or_make(filename)
def method(arg):
#things need to be done
pass
return data

Customizing unittest.mock.mock_open for iteration

How should I customize unittest.mock.mock_open to handle this code?
file: impexpdemo.py
def import_register(register_fn):
with open(register_fn) as f:
return [line for line in f]
My first attempt tried read_data.
class TestByteOrderMark1(unittest.TestCase):
REGISTER_FN = 'test_dummy_path'
TEST_TEXT = ['test text 1\n', 'test text 2\n']
def test_byte_order_mark_absent(self):
m = unittest.mock.mock_open(read_data=self.TEST_TEXT)
with unittest.mock.patch('builtins.open', m):
result = impexpdemo.import_register(self.REGISTER_FN)
self.assertEqual(result, self.TEST_TEXT)
This failed, presumably because the code doesn't use read, readline, or readlines.
The documentation for unittest.mock.mock_open says, "read_data is a string for the read(), readline(), and readlines() methods of the file handle to return. Calls to those methods will take data from read_data until it is depleted. The mock of these methods is pretty simplistic. If you need more control over the data that you are feeding to the tested code you will need to customize this mock for yourself. read_data is an empty string by default."
As the documentation gave no hint on what kind of customization would be required I tried return_value and side_effect. Neither worked.
class TestByteOrderMark2(unittest.TestCase):
REGISTER_FN = 'test_dummy_path'
TEST_TEXT = ['test text 1\n', 'test text 2\n']
def test_byte_order_mark_absent(self):
m = unittest.mock.mock_open()
m().side_effect = self.TEST_TEXT
with unittest.mock.patch('builtins.open', m):
result = impexpdemo.import_register(self.REGISTER_FN)
self.assertEqual(result, self.TEST_TEXT)
The mock_open() object does indeed not implement iteration.
If you are not using the file object as a context manager, you could use:
m = unittest.mock.MagicMock(name='open', spec=open)
m.return_value = iter(self.TEST_TEXT)
with unittest.mock.patch('builtins.open', m):
Now open() returns an iterator, something that can be directly iterated over just like a file object can be, and it'll also work with next(). It can not, however, be used as a context manager.
You can combine this with mock_open() then provide a __iter__ and __next__ method on the return value, with the added benefit that mock_open() also adds the prerequisites for use as a context manager:
# Note: read_data must be a string!
m = unittest.mock.mock_open(read_data=''.join(self.TEST_TEXT))
m.return_value.__iter__ = lambda self: self
m.return_value.__next__ = lambda self: next(iter(self.readline, ''))
The return value here is a MagicMock object specced from the file object (Python 2) or the in-memory file objects (Python 3), but only the read, write and __enter__ methods have been stubbed out.
The above doesn't work in Python 2 because a) Python 2 expects next to exist, not __next__ and b) next is not treated as a special method in Mock (rightly so), so even if you renamed __next__ to next in the above example the type of the return value won't have a next method. For most cases it would be enough to make the file object produced an iterable rather than an iterator with:
# Python 2!
m = mock.mock_open(read_data=''.join(self.TEST_TEXT))
m.return_value.__iter__ = lambda self: iter(self.readline, '')
Any code that uses iter(fileobj) will then work (including a for loop).
There is a open issue in the Python tracker that aims to remedy this gap.
As of Python 3.6, the mocked file-like object returned by the unittest.mock_open method doesn't support iteration. This bug was reported in 2014 and it is still open as of 2017.
Thus code like this silently yields zero iterations:
f_open = unittest.mock.mock_open(read_data='foo\nbar\n')
f = f_open('blah')
for line in f:
print(line)
You can work around this limitation via adding a method to the mocked object that returns a proper line iterator:
def mock_open(*args, **kargs):
f_open = unittest.mock.mock_open(*args, **kargs)
f_open.return_value.__iter__ = lambda self : iter(self.readline, '')
return f_open
I found the following solution:
text_file_data = '\n'.join(["a line here", "the second line", "another
line in the file"])
with patch('__builtin__.open', mock_open(read_data=text_file_data),
create=True) as m:
# mock_open doesn't properly handle iterating over the open file with for line in file:
# but if we set the return value like this, it works.
m.return_value.__iter__.return_value = text_file_data.splitlines()
with open('filename', 'rU') as f:
for line in f:
print line

Reading objects from file after dumping then to file

I've made a function to write objects to a file:
def StoreToFile(Thefile,objekt):
utfil=None
utfil=open(Thefile,'wb')
pickle.dump(objekt,utfil)
return True
if utfil is not None:
utfil.close()
And my code to use this function:
for st in Stadion:
StoreToFile(r'C:\pytest\prod.psr',st)
This works like a charm, but how can I put the objects back to a list object?
I have the code to extract the objects, but I'm unable to see how I can iterate through the objects to put them in a new list.
So far I have this:
def ReadFromFile(filename):
infile=None
infile=open(filename,'rb')
objekt=pickle.load(infile)
for st in Stadion:
StoreToFile(r'C:\pytest\prod.psr',st)
This works like a charm.
If you mean "run without errors", then yes, it does "work". This code repeatedly overwrites the file, so it will only contain the last item in the list.
Use this instead:
StoreToFile(r'C:\pytest\prod.psr', Stadion)
Your ReadFromFile() function should work just fine as it is and return a list (assuming above fix).
Also not sure what this does:
return True
if Thefile.close()
Your code is silly the utfil = None business doesn't make sense, because the only way open(...) can fail is with an exception, in which case the rest of the function won't be executed anyway. The right way to do this is with a context manager: the with statement.
Instead, do:
def storeToFile(path, o):
try:
with open(path, "wb") as f:
pickle.dump(o, f)
return True
except pickle.PicklingError, IOError:
return False
You should just pickle the whole list.
To pickle the objects to the same file, use this function:
def storeToFile(fileName, o):
try:
with open(fileName, "a") as file:
cPickle.dump(o, file)
return True
except pickle.PicklingError, IOError:
return False
Note that the file is opened with mode "a", so that new data is appended to the end.
To load the objects again, use this:
def loadEntireFile(fileName):
try:
with open(fileName) as file:
unpickler = cPickle.Unpickler(file)
while True:
yield unpickler.load()
except EOFError:
pass
This function tries to load objects from the file until it encounters EOF, which is indicated by an EOFError. You can use it like this:
foo = [str(x) for x in range(10)]
for x in foo:
storeToFile("test.pickle", x)
foo2 = list(load("test.pickle"))
The list function takes any iterable and builds a list from it. The function loadEntireFile contains a yield statement, making it a generator, so it can be passed to any function taking an iterable.

Categories

Resources