Python pickle - how does it break?

Python pickle - how does it break? - python

Everyone knows pickle is not a secure way to store user data. It even says so on the box.
I'm looking for examples of strings or data structures that break pickle parsing in the current supported versions of cPython >= 2.4. Are there things that can be pickled but not unpickled? Are there problems with particular unicode characters? Really big data structures? Obviously the old ASCII protocol has some issues, but what about the most current binary form?
I'm particularly curious about ways in which the pickle loads operation can fail, especially when given a string produced by pickle itself. Are there any circumstances in which pickle will continue parsing past the .?
What sort of edge cases are there?
Edit: Here are some examples of the sort of thing I'm looking for:
In Python 2.4, you can pickle an array without error, but you can't unpickle it. http://bugs.python.org/issue1281383
You can't reliably pickle objects that inherit from dict and call __setitem__ before instance variables are set with __setstate__. This can be a gotcha when pickling Cookie objects. See http://bugs.python.org/issue964868 and http://bugs.python.org/issue826897
Python 2.4 (and 2.5?) will return a pickle value for infinity (or values close to it like 1e100000), but may (depending on platform) fail when loading. See http://bugs.python.org/issue880990 and http://bugs.python.org/issue445484
This last item is interesting because it reveals a case where the STOP marker does not actually stop parsing - when the marker exists as part of a literal, or more generally, when not preceded by a newline.

This is a greatly simplified example of what pickle didn't like about my data structure.
import cPickle as pickle
class Member(object):
def __init__(self, key):
self.key = key
self.pool = None
def __hash__(self):
return self.key
class Pool(object):
def __init__(self):
self.members = set()
def add_member(self, member):
self.members.add(member)
member.pool = self
member = Member(1)
pool = Pool()
pool.add_member(member)
with open("test.pkl", "w") as f:
pickle.dump(member, f, pickle.HIGHEST_PROTOCOL)
with open("test.pkl", "r") as f:
x = pickle.load(f)
Pickle is known to be a little funny with circular structures, but if you toss custom hash functions and sets/dicts into the mix then things get quite hairy.
In this particular example it partially unpickles the member and then encounters the pool. So it then partially unpickles the pool and encounters the members set. So it creates the set and tries to add the partially unpickled member to the set. At which point it dies in the custom hash function, because the member is only partially unpickled. I dread to think what might happen if you had an "if hasattr..." in the hash function.
$ python --version
Python 2.6.5
$ python test.py
Traceback (most recent call last):
File "test.py", line 25, in <module>
x = pickle.load(f)
File "test.py", line 8, in __hash__
return self.key
AttributeError: ("'Member' object has no attribute 'key'", <type 'set'>, ([<__main__.Member object at 0xb76cdaac>],))

If you are interested in how things fail with pickle (or cPickle, as it's just a slightly different import), you can use this growing list of all the different object types in python to test against fairly easily.
https://github.com/uqfoundation/dill/blob/master/dill/_objects.py
The package dill includes functions that discover how an object fails to pickle, for example by catching the error it throws and returning it to the user.
dill.dill has these functions, which you could also build for pickle or cPickle, simply with a cut-and-paste and an import pickle or import cPickle as pickle (or import dill as pickle):
def copy(obj, *args, **kwds):
"""use pickling to 'copy' an object"""
return loads(dumps(obj, *args, **kwds))
# quick sanity checking
def pickles(obj,exact=False,safe=False,**kwds):
"""quick check if object pickles with dill"""
if safe: exceptions = (Exception,) # RuntimeError, ValueError
else:
exceptions = (TypeError, AssertionError, PicklingError, UnpicklingError)
try:
pik = copy(obj, **kwds)
try:
result = bool(pik.all() == obj.all())
except AttributeError:
result = pik == obj
if result: return True
if not exact:
return type(pik) == type(obj)
return False
except exceptions:
return False
and includes these in dill.detect:
def baditems(obj, exact=False, safe=False): #XXX: obj=globals() ?
"""get items in object that fail to pickle"""
if not hasattr(obj,'__iter__'): # is not iterable
return [j for j in (badobjects(obj,0,exact,safe),) if j is not None]
obj = obj.values() if getattr(obj,'values',None) else obj
_obj = [] # can't use a set, as items may be unhashable
[_obj.append(badobjects(i,0,exact,safe)) for i in obj if i not in _obj]
return [j for j in _obj if j is not None]
def badobjects(obj, depth=0, exact=False, safe=False):
"""get objects that fail to pickle"""
if not depth:
if pickles(obj,exact,safe): return None
return obj
return dict(((attr, badobjects(getattr(obj,attr),depth-1,exact,safe)) \
for attr in dir(obj) if not pickles(getattr(obj,attr),exact,safe)))
def badtypes(obj, depth=0, exact=False, safe=False):
"""get types for objects that fail to pickle"""
if not depth:
if pickles(obj,exact,safe): return None
return type(obj)
return dict(((attr, badtypes(getattr(obj,attr),depth-1,exact,safe)) \
for attr in dir(obj) if not pickles(getattr(obj,attr),exact,safe)))
and this last function, which is what you can use to test the objects in dill._objects
def errors(obj, depth=0, exact=False, safe=False):
"""get errors for objects that fail to pickle"""
if not depth:
try:
pik = copy(obj)
if exact:
assert pik == obj, \
"Unpickling produces %s instead of %s" % (pik,obj)
assert type(pik) == type(obj), \
"Unpickling produces %s instead of %s" % (type(pik),type(obj))
return None
except Exception:
import sys
return sys.exc_info()[1]
return dict(((attr, errors(getattr(obj,attr),depth-1,exact,safe)) \
for attr in dir(obj) if not pickles(getattr(obj,attr),exact,safe)))

It is possible to pickle class instances. If I knew what classes your application uses, then I could subvert them. A contrived example:
import subprocess
class Command(object):
def __init__(self, command):
self._command = self._sanitize(command)
#staticmethod
def _sanitize(command):
return filter(lambda c: c in string.letters, command)
def run(self):
subprocess.call('/usr/lib/myprog/%s' % self._command, shell=True)
Now if your program creates Command instances and saves them using pickle, and I could subvert or inject into that storage, then I could run any command I choose by setting self._command directly.
In practice my example should never pass for secure code anyway. But note that if the sanitize function is secure, then so is the entire class, apart from the possible use of pickle from untrusted data breaking this. Therefore, there exist programs which are secure but can be made insecure by the inappropriate use of pickle.
The danger is that your pickle-using code could be subverted along the same principle but in innocent-looking code where the vulnerability is far less obvious. The best thing to do is to always avoid using pickle to load untrusted data.

Related

Pickling dynamically created types

I've been trying to get some dynamically created types (i.e. ones created by calling 3-arg type()) to pickle and unpickle nicely. I've been using this module switching trick to hide the details from users of the module and give clean semantics.
I've learned several things already:
The type must be findable with getattr on the module itself
The type must be consistent with what getattr finds, that is to say if we call pickle.dumps(o) then it must be true that type(o) == getattr(module, 'name of type')
Where I'm stuck though is that there still seems to be something odd going on - it seems to be calling __getstate__ on something unexpected.
Here's the simplest setup I've got that reproduces the issue, testing with Python 3.5, but I'd like to target back to 3.3 if possible:
# module.py
import sys
import functools
def dump(self):
return b'Some data' # Dummy for testing
def undump(self, data):
print('Undump: %r' % data) # Do nothing for testing
# Cheaty demo way to make this consistent
#functools.lru_cache(maxsize=None)
def make_type(name):
return type(name, (), {
'__getstate__': dump,
'__setstate__': undump,
})
class Magic(object):
def __init__(self, path):
self.path = path
def __getattr__(self, name):
print('Getting thing: %s (from: %s)' % (name, self.path))
# for simple testing all calls to make_type must end in last x.y.z.last
if name != 'last':
if self.path:
return Magic(self.path + '.' + name)
else:
return Magic(name)
return make_type(self.path + '.' + name)
# Make the switch
sys.modules[__name__] = Magic('')
And then a quick way to exercise that:
import module
import pickle
f=module.foo.bar.woof.last()
print(f.__getstate__()) # See, *this* works
print('Pickle starts here')
print(pickle.dumps(f))
Which then gives:
Getting thing: foo (from: )
Getting thing: bar (from: foo)
Getting thing: woof (from: foo.bar)
Getting thing: last (from: foo.bar.woof)
b'Some data'
Pickle starts here
Getting thing: __spec__ (from: )
Getting thing: _initializing (from: __spec__)
Getting thing: foo (from: )
Getting thing: bar (from: foo)
Getting thing: woof (from: foo.bar)
Getting thing: last (from: foo.bar.woof)
Getting thing: __getstate__ (from: foo.bar.woof)
Traceback (most recent call last):
File "test.py", line 7, in <module>
print(pickle.dumps(f))
TypeError: 'Magic' object is not callable
I wasn't expecting to see anything looking up __getstate__ on module.foo.bar.woof, but even if we force that lookup to fail by adding:
if name == '__getstate__': raise AttributeError()
into our __getattr__ it still fails with:
Traceback (most recent call last):
File "test.py", line 7, in <module>
print(pickle.dumps(f))
_pickle.PicklingError: Can't pickle <class 'module.Magic'>: it's not the same object as module.Magic
What gives? Am I missing something with __spec__? The docs for __spec__ pretty much just stress setting it appropriately, but don't seem to actually explain much.
More importantly the bigger question is how am I supposed to go about making types I programatically generated via a pseudo module's __getattr__ implementation pickle properly?
(And obviously once I've managed to get pickle.dumps to produce something I expect pickle.loads to call undump with the same thing)

To pickle f, pickle needs to pickle f's class, module.foo.bar.woof.last.
The docs don't claim support for pickling arbitrary classes. They claim the following:
The following types can be pickled:
...
classes that are defined at the top level of a module
module.foo.bar.woof.last isn't defined at the top level of a module, even a pretend module like module. In this not-officially-supported case, the pickle logic ends up trying to pickle module.foo.bar.woof, either here:
elif parent is not module:
self.save_reduce(getattr, (parent, lastname))
or here
else if (parent != module) {
PickleState *st = _Pickle_GetGlobalState();
PyObject *reduce_value = Py_BuildValue("(O(OO))",
st->getattr, parent, lastname);
status = save_reduce(self, reduce_value, NULL);
module.foo.bar.woof can't be pickled for multiple reasons. It returns a non-callable Magic instance for all unsupported method lookups, like __getstate__, which is where your first error comes from. The module-switching thing prevents finding the Magic class to pickle it, which is where your second error comes from. There are probably more incompatibilities.

As it seems, and is already proven that making the class callable is just a drifting out another wrong direction, thankfully to this hack, I could find a getaround to make the class reiterable by its TYPE. following the context of the error <class 'module.Magic'>: it's not the same object as module.Magic the pickler doesn't iterate through the same call that renders a different type from the other one, this is a major common problem with pickling self instanciating classes, for this instance, an object by its class, there for the solution is patching the class with its type #mock.patch('module.Magic', type(module.Magic)) this is a short answer for a something.
Main.py
import module
import pickle
import mock
f=module1.foo.bar.woof.last
print(f().__getstate__()) # See, *this* works
print('Pickle starts here')
#mock.patch('module1.Magic', type(module1.Magic))
def pickleit():
return pickle.dumps(f())
print(pickleit())
Magic class
class Magic(object):
def __init__(self, value):
self.path = value
__class__: lambda x:x
def __getstate__(self):
print ("Shoot me! i'm at " + self.path )
return dump(self)
def __setstate__(self,value):
print ('something will never occur')
return undump(self,value)
def __spec__(self):
print ("Wrong side of the planet ")
def _initializing(self):
print ("Even farther lost ")
def __getattr__(self, name):
print('Getting thing: %s (from: %s)' % (name, self.path))
# for simple testing all calls to make_type must end in last x.y.z.last
if name != 'last':
if self.path:
return Magic(self.path + '.' + name)
else:
return Magic(name)
print('terminal stage' )
return make_type(self.path + '.' + name)
Even assuming this is not more of striking the ball by the edge of the bat, I could see the content dumped into my console.

How to differentiate a file like object from a file path like object

Summary:
There is a variety of function for which it would be very useful to be able to pass in two kinds of objects: an object that represents a path (usually a string), and an object that represents a stream of some sort (often something derived from IOBase, but not always). How can this variety of function differentiate between these two kinds of objects so they can be handled appropriately?
Say I have a function intended to write a file from some kind of object file generator method:
spiff = MySpiffy()
def spiffy_file_makerA(spiffy_obj, file):
file_str = '\n'.join(spiffy_obj.gen_file())
file.write(file_str)
with open('spiff.out', 'x') as f:
spiffy_file_makerA(spiff, f)
...do other stuff with f...
This works. Yay. But I'd prefer to not have to worry about opening the file first or passing streams around, at least sometimes... so I refactor with the ability to take a file path like object instead of a file like object, and a return statement:
def spiffy_file_makerB(spiffy_obj, file, mode):
file_str = '\n'.join(spiffy_obj.gen_file())
file = open(file, mode)
file.write(file_str)
return file
with spiffy_file_makerB(spiff, 'file.out', 'x') as f:
...do other stuff with f...
But now I get the idea that it would be useful to have a third function that combines the other two versions depending on whether file is file like, or file path like, but returns the f destination file like object to a context manager. So that I can write code like this:
with spiffy_file_makerAB(spiffy_obj, file_path_like, mode = 'x') as f:
...do other stuff with f...
...but also like this:
file_like_obj = get_some_socket_or_stream()
with spiffy_file_makerAB(spiffy_obj, file_like_obj, mode = 'x'):
...do other stuff with file_like_obj...
# file_like_obj stream closes when context manager exits
# unless `closefd=False`
Note that this will require something a bit different than the simplified versions provided above.
Try as a I might, I haven't been able to find an obvious way to do this, and the ways I have found seem pretty contrived and just a potential for problems later. For example:
def spiffy_file_makerAB(spiffy_obj, file, mode, *, closefd=True):
try:
# file-like (use the file descriptor to open)
result_f = open(file.fileno(), mode, closefd=closefd)
except TypeError:
# file-path-like
result_f = open(file, mode)
finally:
file_str = '\n'.join(spiffy_obj.gen_file())
result_f.write(file_str)
return result_f
Are there any suggestions for a better way? Am I way off base and need to be handling this completely differently?

For my money, and this is an opinionated answer, checking for the attributes of the file-like object for the operations you will need is a pythonic way to determine an object’s type because that is the nature of pythonic duck tests/duck-typing:
Duck typing is heavily used in Python, with the canonical example being file-like classes (for example, cStringIO allows a Python string to be treated as a file).
Or from the python docs’ definition of duck-typing
A programming style which does not look at an object’s type to determine if it has the right interface; instead, the method or attribute is simply called or used (“If it looks like a duck and quacks like a duck, it must be a duck.”) By emphasizing interfaces rather than specific types, well-designed code improves its flexibility by allowing polymorphic substitution. Duck-typing avoids tests using type() or isinstance(). (Note, however, that duck-typing can be complemented with abstract base classes.) Instead, it typically employs hasattr() tests or EAFP programming.
If you feel very strongly that there is some very good reason that just checking the interface for suitability isn't enough, you can just reverse the test and test for basestring or str to test whether the provided object is path-like. The test will be different depending on your version of python.
is_file_like = not isinstance(fp, basestring) # python 2
is_file_like = not isinstance(fp, str) # python 3
In any case, for your context manager, I would go ahead and make a full-blown object like the below in order to wrap the functionality that you were looking for.
class SpiffyContextGuard(object):
def __init__(self, spiffy_obj, file, mode, closefd=True):
self.spiffy_obj = spiffy_obj
is_file_like = all(hasattr(attr) for attr in ('seek', 'close', 'read', 'write'))
self.fp = file if is_file_like else open(file, mode)
self.closefd = closefd
def __enter__(self):
return self.fp
def __exit__(self, type_, value, traceback):
generated = '\n'.join(self.spiffy_obj.gen_file())
self.fp.write(generated)
if self.closefd:
self.fp.__exit__()
And then use it like this:
with SpiffyContextGuard(obj, 'hamlet.txt', 'w', True) as f:
f.write('Oh that this too too sullied flesh\n')
fp = open('hamlet.txt', 'a')
with SpiffyContextGuard(obj, fp, 'a', False) as f:
f.write('Would melt, thaw, resolve itself into a dew\n')
with SpiffyContextGuard(obj, fp, 'a', True) as f:
f.write('Or that the everlasting had not fixed his canon\n')
If you wanted to use try/catch semantics to check for type suitability, you could also wrap the file operations you wanted to expose on your context guard:
class SpiffyContextGuard(object):
def __init__(self, spiffy_obj, file, mode, closefd=True):
self.spiffy_obj = spiffy_obj
self.fp = self.file_or_path = file
self.mode = mode
self.closefd = closefd
def seek(self, offset, *args):
try:
self.fp.seek(offset, *args)
except AttributeError:
self.fp = open(self.file_or_path, mode)
self.fp.seek(offset, *args)
# define wrappers for write, read, etc., as well
def __enter__(self):
return self
def __exit__(self, type_, value, traceback):
generated = '\n'.join(self.spiffy_obj.gen_file())
self.write(generated)
if self.closefd:
self.fp.__exit__()

my suggestion is to pass pathlib.Path objects around. you can simply .write_bytes(...) or .write_text(...) to these objects.
other that that you'd have to check the type of your file variable (this is how polymorphism can be done in python):
from io import IOBase
def some_function(file)
if isinstance(file, IOBase):
file.write(...)
else:
with open(file, 'w') as file_handler:
file_handler.write(...)
(i hope io.IOBase is the most basic class to check against...). and you would have to catch possible exceptions around all that.

Probably not the answer you're looking for, but from a taste point of view I think it's better to have functions that only do one thing. Reasoning about them is easier this way.
I'd just have two functions: spiffy_file_makerA(spiffy_obj, file), which handles your first case, and a convenience function that wraps spiffy_file_makerA and creates a file for you.

Another approach to this problem, inspired by this talk from Raymond Hettinger at PyCon 2013, would be to keep the two functions separate as suggested by a couple of the other answers, but to bring the functions together into a class with a number of alternative options for outputting the object.
Continuing with the example I started with, it might look something like this:
class SpiffyFile(object):
def __init__(self, spiffy_obj, file_path = None, *, mode = 'w'):
self.spiffy = spiffy_obj
self.file_path = file_path
self.mode = mode
def to_str(self):
return '\n'.join(self.spiffy.gen_file())
def to_stream(self, fstream):
fstream.write(self.to_str())
def __enter__(self):
try:
# do not override an existing stream
self.fstream
except AttributeError:
# convert self.file_path to str to allow for pathlib.Path objects
self.fstream = open(str(self.file_path), mode = self.mode)
return self
def __exit__(self, exc_t, exc_v, tb):
self.fstream.close()
del self.fstream
def to_file(self, file_path = None, mode = None):
if mode is None:
mode = self.mode
try:
fstream = self.fstream
except AttributeError:
if file_path is None:
file_path = self.file_path
# convert file_path to str to allow for pathlib.Path objects
with open(str(file_path), mode = mode) as fstream:
self.to_stream(fstream)
else:
if mode != fstream.mode:
raise IOError('Ambiguous stream output mode: \
provided mode and fstream.mode conflict')
if file_path is not None:
raise IOError('Ambiguous output destination: \
a file_path was provided with an already active file stream.')
self.to_stream(fstream)
Now we have lots of different options for exporting a MySpiffy object by using a SpiffyFile object. We can just write it to a file directly:
from pathlib import Path
spiff = MySpiffy()
p = Path('spiffies')/'new_spiff.txt'
SpiffyFile(spiff, p).to_file()
We can override the path, too:
SpiffyFile(spiff).to_file(p.parent/'other_spiff.text')
But we can also use an existing open stream:
SpiffyFile(spiff).to_stream(my_stream)
Or, if we want to edit the string first we could open a new file stream ourselves and write the edited string to it:
my_heading = 'This is a spiffy object\n\n'
with open(str(p), mode = 'w') as fout:
spiff_out = SpiffyFile(spiff).to_str()
fout.write(my_heading + spiff_out)
And finally, we can just use a context manager with the SpiffyFile object directly to as many different locations- or streams- as we like (note that we can pass the pathlib.Path object directly without worrying about string conversion, which is nifty):
with SpiffyFile(spiff, p) as spiff_file:
spiff_file.to_file()
spiff_file.to_file(p.parent/'new_spiff.txt')
print(spiff_file.to_str())
spiff_file.to_stream(my_open_stream)
This approach is more consistent with the mantra: explicit is better than implicit.

pylint on in-memory file/stream

I'd like to embed pylint in a program. The user enters python programs (in Qt, in a QTextEdit, although not relevant) and in the background I call pylint to check the text he enters. Finally, I print the errors in a message box.
There are thus two questions: First, how can I do this without writing the entered text to a temporary file and giving it to pylint ? I suppose at some point pylint (or astroid) handles a stream and not a file anymore.
And, more importantly, is it a good idea ? Would it cause problems for imports or other stuffs ? Intuitively I would say no since it seems to spawn a new process (with epylint) but I'm no python expert so I'm really not sure. And if I use this to launch pylint, is it okay too ?
Edit:
I tried tinkering with pylint's internals, event fought with it, but finally have been stuck at some point.
Here is the code so far:
from astroid.builder import AstroidBuilder
from astroid.exceptions import AstroidBuildingException
from logilab.common.interface import implements
from pylint.interfaces import IRawChecker, ITokenChecker, IAstroidChecker
from pylint.lint import PyLinter
from pylint.reporters.text import TextReporter
from pylint.utils import PyLintASTWalker
class Validator():
def __init__(self):
self._messagesBuffer = InMemoryMessagesBuffer()
self._validator = None
self.initValidator()
def initValidator(self):
self._validator = StringPyLinter(reporter=TextReporter(output=self._messagesBuffer))
self._validator.load_default_plugins()
self._validator.disable('W0704')
self._validator.disable('I0020')
self._validator.disable('I0021')
self._validator.prepare_import_path([])
def destroyValidator(self):
self._validator.cleanup_import_path()
def check(self, string):
return self._validator.check(string)
class InMemoryMessagesBuffer():
def __init__(self):
self.content = []
def write(self, st):
self.content.append(st)
def messages(self):
return self.content
def reset(self):
self.content = []
class StringPyLinter(PyLinter):
"""Does what PyLinter does but sets checkers once
and redefines get_astroid to call build_string"""
def __init__(self, options=(), reporter=None, option_groups=(), pylintrc=None):
super(StringPyLinter, self).__init__(options, reporter, option_groups, pylintrc)
self._walker = None
self._used_checkers = None
self._tokencheckers = None
self._rawcheckers = None
self.initCheckers()
def __del__(self):
self.destroyCheckers()
def initCheckers(self):
self._walker = PyLintASTWalker(self)
self._used_checkers = self.prepare_checkers()
self._tokencheckers = [c for c in self._used_checkers if implements(c, ITokenChecker)
and c is not self]
self._rawcheckers = [c for c in self._used_checkers if implements(c, IRawChecker)]
# notify global begin
for checker in self._used_checkers:
checker.open()
if implements(checker, IAstroidChecker):
self._walker.add_checker(checker)
def destroyCheckers(self):
self._used_checkers.reverse()
for checker in self._used_checkers:
checker.close()
def check(self, string):
modname = "in_memory"
self.set_current_module(modname)
astroid = self.get_astroid(string, modname)
self.check_astroid_module(astroid, self._walker, self._rawcheckers, self._tokencheckers)
self._add_suppression_messages()
self.set_current_module('')
self.stats['statement'] = self._walker.nbstatements
def get_astroid(self, string, modname):
"""return an astroid representation for a module"""
try:
return AstroidBuilder().string_build(string, modname)
except SyntaxError as ex:
self.add_message('E0001', line=ex.lineno, args=ex.msg)
except AstroidBuildingException as ex:
self.add_message('F0010', args=ex)
except Exception as ex:
import traceback
traceback.print_exc()
self.add_message('F0002', args=(ex.__class__, ex))
if __name__ == '__main__':
code = """
a = 1
print(a)
"""
validator = Validator()
print(validator.check(code))
The traceback is the following:
Traceback (most recent call last):
File "validator.py", line 16, in <module>
main()
File "validator.py", line 13, in main
print(validator.check(code))
File "validator.py", line 30, in check
self._validator.check(string)
File "validator.py", line 79, in check
self.check_astroid_module(astroid, self._walker, self._rawcheckers, self._tokencheckers)
File "c:\Python33\lib\site-packages\pylint\lint.py", line 659, in check_astroid_module
tokens = tokenize_module(astroid)
File "c:\Python33\lib\site-packages\pylint\utils.py", line 103, in tokenize_module
print(module.file_stream)
AttributeError: 'NoneType' object has no attribute 'file_stream'
# And sometimes this is added :
File "c:\Python33\lib\site-packages\astroid\scoped_nodes.py", line 251, in file_stream
return open(self.file, 'rb')
OSError: [Errno 22] Invalid argument: '<?>'
I'll continue digging tomorrow. :)

I got it running.
the first one (NoneType …) is really easy and a bug in your code:
Encountering an exception can make get_astroid “fail”, i.e. send one syntax error message and return!
But for the secong one… such bullshit in pylint’s/logilab’s API… Let me explain: Your astroid object here is of type astroid.scoped_nodes.Module.
It’s also created by a factory, AstroidBuilder, which sets astroid.file = '<?>'.
Unfortunately, the Module class has following property:
#property
def file_stream(self):
if self.file is not None:
return open(self.file, 'rb')
return None
And there’s no way to skip that except for subclassing (Which would render us unable to use the magic in AstroidBuilder), so… monkey patching!
We replace the ill-defined property with one that checks an instance for a reference to our code bytes (e.g. astroid._file_bytes) before engaging in above default behavior.
def _monkeypatch_module(module_class):
if module_class.file_stream.fget.__name__ == 'file_stream_patched':
return # only patch if patch isn’t already applied
old_file_stream_fget = module_class.file_stream.fget
def file_stream_patched(self):
if hasattr(self, '_file_bytes'):
return BytesIO(self._file_bytes)
return old_file_stream_fget(self)
module_class.file_stream = property(file_stream_patched)
That monkeypatching can be called just before calling check_astroid_module. But one more thing has to be done. See, there’s more implicit behavior: Some checkers expect and use astroid’s file_encoding field. So we now have this code in the middle of check:
astroid = self.get_astroid(string, modname)
if astroid is not None:
_monkeypatch_module(astroid.__class__)
astroid._file_bytes = string.encode('utf-8')
astroid.file_encoding = 'utf-8'
self.check_astroid_module(astroid, self._walker, self._rawcheckers, self._tokencheckers)
One could say that no amount of linting creates actually good code. Unfortunately pylint unites enormous complexity with a specialization of calling it on files. Really good code has a nice native API and wraps that with a CLI interface. Don’t ask me why file_stream exists if internally, Module gets built from but forgets the source code.
PS: i had to change sth else in your code: load_default_plugins has to come before some other stuff (maybe prepare_checkers, maybe sth. else)
PPS: i suggest subclassing BaseReporter and using that instead of your InMemoryMessagesBuffer
PPPS: this just got pulled (3.2014), and will fix this: https://bitbucket.org/logilab/astroid/pull-request/15/astroidbuilderstring_build-was/diff
4PS: this is now in the official version, so no monkey patching required: astroid.scoped_nodes.Module now has a file_bytes property (without leading underscore).

Working with an unlocatable stream may definitly cause problems in case of relative imports, since the location is then needed to find the actually imported module.
Astroid support building an AST from a stream, but this is not used/exposed through Pylint which is a level higher and designed to work with files. So while you may acheive this it will need a bit of digging into the low-level APIs.
The easiest way is definitly to save the buffer to the file then to use the SA answer to start pylint programmatically if you wish (totally forgot this other account of mine found in other responses ;). Another option being to write a custom reporter to gain more control.

Store the cache to a file functools.lru_cache in Python >= 3.2

I'm using #functools.lru_cache in Python 3.3. I would like to save the cache to a file, in order to restore it when the program will be restarted. How could I do?
Edit 1 Possible solution: We need to pickle any sort of callable
Problem pickling __closure__:
_pickle.PicklingError: Can't pickle <class 'cell'>: attribute lookup builtins.cell failed
If I try to restore the function without it, I get:
TypeError: arg 5 (closure) must be tuple

You can't do what you want using lru_cache, since it doesn't provide an API to access the cache, and it might be rewritten in C in future releases. If you really want to save the cache you have to use a different solution that gives you access to the cache.
It's simple enough to write a cache yourself. For example:
from functools import wraps
def cached(func):
func.cache = {}
#wraps(func)
def wrapper(*args):
try:
return func.cache[args]
except KeyError:
func.cache[args] = result = func(*args)
return result
return wrapper
You can then apply it as a decorator:
>>> #cached
... def fibonacci(n):
... if n < 2:
... return n
... return fibonacci(n-1) + fibonacci(n-2)
...
>>> fibonacci(100)
354224848179261915075L
And retrieve the cache:
>>> fibonacci.cache
{(32,): 2178309, (23,): 28657, ... }
You can then pickle/unpickle the cache as you please and load it with:
fibonacci.cache = pickle.load(cache_file_object)
I found a feature request in python's issue tracker to add dumps/loads to lru_cache, but it wasn't accepted/implemented. Maybe in the future it will be possible to have built-in support for these operations via lru_cache.

You can use a library of mine, mezmorize
import random
from mezmorize import Cache
cache = Cache(CACHE_TYPE='filesystem', CACHE_DIR='cache')
#cache.memoize()
def add(a, b):
return a + b + random.randrange(0, 1000)
>>> add(2, 5)
727
>>> add(2, 5)
727

Consider using joblib.Memory for persistent caching to the disk.
Since the disk is enormous, there's no need for an LRU caching scheme.

You are not supposed to touch anything inside the decorator implementation except for the public API so if you want to change its behavior you probably need to copy its implementation and add necessary functions yourself. Note that the cache is currently stored as a circular doubly linked list so you will need to take care when saving and loading it.

This is something that I wrote might be helpful devcache.
It's designed to help you speed up iterations for long running methods. It's configurable with a config file
#devcache(group='crm')
def my_method(a, b, c):
...
#devcache(group='db')
def another_method(a, b, c):
...
The cache can be refreshed or used with a yaml config file like:
refresh: false # refresh true will ignore use_cache and refresh all cached data
props:
1:
group: crm
use_cache: false
2:
group: db
use_cache: true
Would refresh the cache for my_method and use the cache for another_method.
It's not going to help you pickle the the callable but it does the caching part and would be straight forward to modify the code to add specialized serialization.

If your use-case is to cache the result of computationally intensive functions in your pytest test suites, pytest already has a file-based cache. See the docs for more info.
This being said, I had a few extra requirements:
I wanted to be able to call the cached function directly in the test instead of from a fixture
I wanted to cache complex python objects, not just simple python primitives/containers
I wanted an implementation that could refresh the cache intelligently (or be forced to invalidate only a single key)
Thus I came up with my own wrapper for the pytest cache, which you
can find below. The implementation is fully documented, but if you
need more info let me know and I'll be happy to edit this answer :)
Enjoy:
from base64 import b64encode, b64decode
import hashlib
import inspect
import pickle
from typing import Any, Optional
import pytest
__all__ = ['cached']
#pytest.fixture
def cached(request):
def _cached(func: callable, *args, _invalidate_cache: bool = False, _refresh_key: Optional[Any] = None, **kwargs):
"""Caches the result of func(*args, **kwargs) cross-testrun.
Cache invalidation can be performed by passing _invalidate_cache=True or a _refresh_key can
be passed for improved control on invalidation policy.
For example, given a function that executes a side effect such as querying a database:
result = query(sql)
can be cached as follows:
refresh_key = query(sql=fast_refresh_sql)
result = cached(query, sql=slow_or_expensive_sql, _refresh_key=refresh_key)
or can be directly invalidated if you are doing rapid iteration of your test:
result = cached(query, sql=sql, _invalidate_cache=True)
Args:
func (callable): Callable that will be called
_invalidate_cache (bool, optional): Whether or not to invalidate_cache. Defaults to False.
_refresh_key (Optional[Any], optional): Refresh key to provide a programmatic way to invalidate cache. Defaults to None.
*args: Positional args to pass to func
**kwargs: Keyword args to pass to func
Returns:
_type_: _description_
"""
# get debug info
# see https://stackoverflow.com/a/24439444/4442749
try:
func_name = getattr(func, '__name__', repr(func))
except:
func_name = '<function>'
try:
caller = inspect.getframeinfo(inspect.stack()[1][0])
except:
func_name = '<file>:<lineno>'
call_key = _create_call_key(func, None, *args, **kwargs)
cached_value = request.config.cache.get(call_key, {"refresh_key": None, "value": None})
value = cached_value["value"]
current_refresh_key = str(b64encode(pickle.dumps(_refresh_key)), encoding='utf8')
cached_refresh_key = cached_value.get("refresh_key")
if (
_invalidate_cache # force invalidate
or cached_refresh_key is None # first time caching this call
or current_refresh_key != cached_refresh_key # refresh_key has changed
):
print("Cache invalidated for '%s' # %s:%d" % (func_name, caller.filename, caller.lineno))
result = func(*args, **kwargs)
value = str(b64encode(pickle.dumps(result)), encoding='utf8')
request.config.cache.set(
key=call_key,
value={
"refresh_key": current_refresh_key,
"value": value
}
)
else:
print("Cache hit for '%s' # %s:%d" % (func_name, caller.filename, caller.lineno))
result = pickle.loads(b64decode(bytes(value, encoding='utf8')))
return result
return _cached
_args_marker = object()
_kwargs_marker = object()
def _create_call_key(func: callable, refresh_key: Any, *args, **kwargs):
"""Produces a hex hash str of the call func(*args, **kwargs)"""
# producing a key from func + args
# see https://stackoverflow.com/a/10220908/4442749
call_key = pickle.dumps(
(func, refresh_key) +
(_args_marker, ) +
tuple(args) +
(_kwargs_marker,) +
tuple(sorted(kwargs.items()))
)
# create a hex digest of the key for the filename
m = hashlib.sha256()
m.update(bytes(call_key))
return m.digest().hex()

Python pickle crash when trying to return default value in getattr

I have a dictionary like class that I use to store some values as attributes. I recently added some logic(__getattr__) to return None if an attribute doesn't exist. As soon as I did this pickle crashed, and I wanted some insight into why?
Test Code:
import cPickle
class DictionaryLike(object):
def __init__(self, **kwargs):
self.__dict__.update(kwargs)
def __iter__(self):
return iter(self.__dict__)
def __getitem__(self, key):
if(self.__dict__.has_key(key)):
return self.__dict__[key]
else:
return None
''' This is the culprit...'''
def __getattr__(self, key):
print 'Retreiving Value ' , key
return self.__getitem__(key)
class SomeClass(object):
def __init__(self, kwargs={}):
self.args = DictionaryLike(**kwargs)
someClass = SomeClass()
content = cPickle.dumps(someClass,-1)
print content
Result:
Retreiving Value __getnewargs__
Traceback (most recent call last):
File <<file>> line 29, in <module>
content = cPickle.dumps(someClass,-1)
TypeError: 'NoneType' object is not callable`
Did I do something stupid? I had read a post that deepcopy() might require that I throw an exception if a key doesn't exist? If this is the case is there any easy way to achieve what I want without throwing an exception?
End result is that if some calls
someClass.args.i_dont_exist
I want it to return None.

Implementing __getattr__ is a bit tricky, since it is called for every non-existing attribute. In your case, the pickle module tests your class for the __getnewargs__ special method and receives None, which is obviously not callable.
You might want to alter __getattr__ to call the base implementation for magic names:
def __getattr__(self, key):
if key.startswith('__') and key.endswith('__'):
return super(DictionaryLike, self).__getattr__(key)
return self.__getitem__(key)
I usually pass through all names starting with an underscore, so that I can sidestep the magic for internal symbols.

You need to raise an AttributeError when an attribute is not present in your class:
def __getattr__(self, key):
i = self.__getitem__(key)
if i == None:
raise AttributeError
return self.__getitem__(key)
I am going to assume that this behavior is required. From the python documentation for getattr, "Called when an attribute lookup has not found the attribute in the usual places (i.e. it is not an instance attribute nor is it found in the class tree for self). name is the attribute name. This method should return the (computed) attribute value or raise an AttributeError exception."
There is no way to tell pickle etc that the attribute it's looking for is not found unless you raise the exception. For example, in your error message pickle is looking for a special callable method called __getnewargs__, pickle expects that if the AttributeError exception is not found the return value is callable.
I guess one potential work around you could perhaps try defining all of the special methods pickle is looking for as dummy methods?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.