How to test if a file has been created by pickle?

How to test if a file has been created by pickle? - python

Is there any way of checking if a file has been created by pickle? I could just catch exceptions thrown by pickle.load but there is no specific "not a pickle file" exception.

Pickle files don't have a header, so there's no standard way of identifying them short of trying to unpickle one and seeing if any exceptions are raised while doing so.
You could define your own enhanced protocol that included some kind of header by subclassing the Pickler() and Unpickler() classes in the pickle module. However this can't be done with the much faster cPickle module because, in it, they're factory functions, which can't be subclassed [1].
A more flexible approach would be define your own independent classes that used corresponding Pickler() and Unpickler() instances from either one of these modules in its implementation.
Update
The last byte of all pickle files should be the pickle.STOP opcode, so while there isn't a header, there is effectively a very minimal trailer which would be a relatively simple thing to check.
Depending on your exact usage, you might be able to get away with supplementing that with something more elaborate (and longer than one byte), since any data past the STOP opcode in a pickled object's representation is ignored [2].
[1] Footnote [2] in the Python 2 documentation.
[2] Documentation forpickle.loads(), which also applies to pickle.load()since it's currently implemented in terms of the former.

There is no sure way other than to try to unpickle it, and catch exceptions.

I was running into this issue and found a fairly decent way of doing it. You can use the built in pickletools module to deconstruct a pickle file and get the pickle operations. With pickle protocol v2 and higher the first opcode will be a PROTO name and the last one as #martineau mentioned is STOP the following code will display these two opcodes. Note that output in this example can be iterated but opcodes can not be directly accessed thus the for loop.
import pickletools
with open("file.pickle", "rb") as f:
pickle = f.read()
output = pickletools.genops(pickle)
opcodes = []
for opcode in output:
opcodes.append(opcode[0])
print(opcodes[0].name)
print(opcodes[-1].name)

Related

Why does linecache check for the length of the tuple elements in the cache?

In https://github.com/python/cpython/blob/3.6/Lib/linecache.py there are 3 instances of this
In getines
def getlines(filename, module_globals=None):
if filename in cache:
entry = cache[filename]
if len(entry) != 1:
return cache[filename][2]
In updatecache
def updatecache(filename, module_globals=None):
if filename in cache:
if len(cache[filename]) != 1:
del cache[filename]
In lazycache
def lazycache(filename, module_globals):
if filename in cache:
if len(cache[filename]) == 1:
return True
else:
return False
I am writing my own version of linecache and to write the tests for it I need to understand the scenario in which the tuple can be of length 1
There was one scenario in which the statement in getlines got executed. It was when the file was accessed for the first time and stored in the cache and then removed before accessing it the second time. But I still cannot figure out why is it there in the other two functions.
It would be very helpful if someone could help me understand the purpose of using this length check.

Look at the places the code stores values in the cache. It can store two different kinds of values under cache[filename]:
A 4-tuple of size, mod time, list of lines, and name (here and here).
A 1-tuple of a function that returns a list of lines on demands (here).
The 1-tuple is used for setting up for lazy loading of modules. Normal files, to lazy-load them, you just open and read. But module source may be available from the module's loader (found via the import system), but not available just by opening and reading a file—e.g., zipimport modules. So lazycache has to stash the loader's get_source method, so if it later needs the lines, it can get them.
And this means that, whenever it uses those cache values, it has to check which kind it's stored and do different things. If it needs the lines now, and it has a lazy 1-tuple, it has to go load the lines (via updatecache; if it's checking for cache eviction and it finds a lazy tuple that was never evaluated, it drops it; etc.
Also notice that in updatecache, if it's loaded the file from a lazy 1-tuple, it doesn't have a mod time, which means that in checkcache, it can't check whether the file is stale. But that's fine for module source—even if you change the file, the old version is still the one that's imported and being used.
If you were designing this from scratch, rather than hacking on something that's been in the stdlib since the early 1.x dark ages, you'd probably design this very differently. For example, you might use a class, or possibly two classes implementing the same interface.
Also, notice that a huge chunk of the code in linecache is there to deal with special cases related to loading module source that, unless you're trying to build something that reflects on Python code (like traceback does), you don't need any of. (And even if you are doing that, you'd probably want to use the inspect module rather than talking directly to the import system.)
So, linecache may not be the best sample code to base your own line cache on.

receive file object or file path

I know that in static languages it's always better to receive a file object rather than a string representing a path (from a software design standpoint). However, in a dynamic language like python where you can't see the type of the variable, what's the "correct" way to pass a file?
Isn't it problematic passing the function object since you need to remember to close it afterwards (which you probably won't since you can't see the type)?

Ideally you would be using the with statement whenever you open a file, so closing will be handled by that.
with open('filepath', 'r') as f:
myfunc(f)
otherstuff() # f is now closed
From the documentation:
It is good practice to use the with keyword when dealing with file
objects. This has the advantage that the file is properly closed after
its suite finishes, even if an exception is raised on the way.

Pass file object just like any other type.
def f(myfile):
myfile.write('asdf')
ofs = open(filepath, 'w') # ofs is file object
f(ofs) # passes file object
ofs.close()
Can also use closure.
def f():
return open(filepath, 'w') # returns file object
ofs = f()
ofs.write('something')
ofs.close()

However, in a dynamic language like python where you can't see the
type of the variable, what's the "correct" way to pass a file?
The short answer is - you don't.
In most object oriented languages, there is an object contract which guarantees that if the object has a method quack, it knows how to quack. Some languages are very strict in enforcing this contract (Java, for example) and others not so much.
In the end it comes down to one of Python's principles EAFP:
E asier to a sk for f orgiveness than p ermission. This common Python
coding style assumes the existence of valid keys or attributes and
catches exceptions if the assumption proves false. This clean and fast
style is characterized by the presence of many try and except
statements. The technique contrasts with the LBYL style common to many
other languages such as C.
LBYL = Look Before You Leap
What this means is that if your method is expecting a "file" (and you state this in your documentation), assume you are being passed a "file like object". Try to execute a file operation on the object (like read() or close()) and then catch the exception if its raised.
One of the main points of the EAFP approach is that you may be getting passed an object that works like a file, in other words - the caller knows what they are doing. So if you spend time checking for exact types, you'll have code that isn't working when it should. Now the burden is on the caller to meet your "object contract"; but what if they are not working with files but with an in-memory buffer (which have the same methods as files)? Or a request object (again, have the same file-like methods). You can't possibly check for all these variations in your code.
This is the preferred approach - instead of the LBYL approach, which would be type checking first.
So, if your method's documentation states that its expecting a file object, it should work with any object that is "file like", but when someone passes it a string to a file path, your method should raise an appropriate exception.
Also, and more importantly - you should avoid closing the object in your method; because it may not be a "file" like explained earlier. However if you absolutely must, make sure the documentation for your method states this very clearly.
Here is an example:
def my_method(fobj):
''' Writes to fobj, which is any file-like object,
and returns the object '''
try:
fobj.write('The answer is: {}\n'.format(42))
except (AttributeError, TypeError):
raise TypeError('Expected file-like object')
return fobj

You can use file objects in Python. When they are (automatically) garbage collected, the file will be closed for you.
File objects are implemented using C’s stdio package and can be created with the built-in open() function.

best way to implement custom pretty-printers

Customizing pprint.PrettyPrinter
The documentation for the pprint module mentions that the method PrettyPrinter.format is intended to make it possible to customize formatting.
I gather that it's possible to override this method in a subclass, but this doesn't seem to provide a way to have the base class methods apply line wrapping and indentation.
Am I missing something here?
Is there a better way to do this (e.g. another module)?
Alternatives?
I've checked out the pretty module, which looks interesting, but doesn't seem to provide a way to customize formatting of classes from other modules without modifying those modules.
I think what I'm looking for is something that would allow me to provide a mapping of types (or maybe functions) that identify types to routines that process a node. The routines that process a node would take a node and return the string representation it, along with a list of child nodes. And so on.
Why I’m looking into pretty-printing
My end goal is to compactly print custom-formatted sections of a DocBook-formatted xml.etree.ElementTree.
(I was surprised to not find more Python support for DocBook. Maybe I missed something there.)
I built some basic functionality into a client called xmlearn that uses lxml. For example, to dump a Docbook file, you could:
xmlearn -i docbook_file.xml dump -f docbook -r book
It's pretty half-ass, but it got me the info I was looking for.
xmlearn has other features too, like the ability to build a graph image and do dumps showing the relationships between tags in an XML document. These are pretty much totally unrelated to this question.
You can also perform a dump to an arbitrary depth, or specify an XPath as a set of starting points. The XPath stuff sort of obsoleted the docbook-specific format, so that isn't really well-developed.
This still isn't really an answer for the question. I'm still hoping that there's a readily customizable pretty printer out there somewhere.

My solution was to replace pprint.PrettyPrinter with a simple wrapper that formats any floats it finds before calling the original printer.
from __future__ import division
import pprint
if not hasattr(pprint,'old_printer'):
pprint.old_printer=pprint.PrettyPrinter
class MyPrettyPrinter(pprint.old_printer):
def _format(self,obj,*args,**kwargs):
if isinstance(obj,float):
obj=round(obj,4)
return pprint.old_printer._format(self,obj,*args,**kwargs)
pprint.PrettyPrinter=MyPrettyPrinter
def pp(obj):
pprint.pprint(obj)
if __name__=='__main__':
x=[1,2,4,6,457,3,8,3,4]
x=[_/17 for _ in x]
pp(x)

This question may be a duplicate of:
Any way to properly pretty-print ordered dictionaries in Python?
Using pprint.PrettyPrinter
I looked through the source of pprint. It seems to suggest that, in order to enhance pprint(), you’d need to:
subclass PrettyPrinter
override _format()
test for issubclass(),
and (if it's not your class), pass back to _format()
Alternative
I think a better approach would be just to have your own pprint(), which defers to pprint.pformat when it doesn't know what's up.
For example:
'''Extending pprint'''
from pprint import pformat
class CrazyClass: pass
def prettyformat(obj):
if isinstance(obj, CrazyClass):
return "^CrazyFoSho^"
else:
return pformat(obj)
def prettyp(obj):
print(prettyformat(obj))
# test
prettyp([1]*100)
prettyp(CrazyClass())
The big upside here is that you don't depend on pprint internals. It’s explicit and concise.
The downside is that you’ll have to take care of indentation manually.

If you would like to modify the default pretty printer without subclassing, you can use the internal _dispatch table on the pprint.PrettyPrinter class. You can see how examples of how dispatching is added for internal types like dictionaries and lists in the source.
Here is how I added a custom pretty printer for MatchPy's Operation type:
import pprint
import matchpy
def _pprint_operation(self, object, stream, indent, allowance, context, level):
"""
Modified from pprint dict https://github.com/python/cpython/blob/3.7/Lib/pprint.py#L194
"""
operands = object.operands
if not operands:
stream.write(repr(object))
return
cls = object.__class__
stream.write(cls.__name__ + "(")
self._format_items(
operands, stream, indent + len(cls.__name__), allowance + 1, context, level
)
stream.write(")")
pprint.PrettyPrinter._dispatch[matchpy.Operation.__repr__] = _pprint_operation
Now if I use pprint.pprint on any object that has the same __repr__ as matchpy.Operation, it will use this method to pretty print it. This works on subclasses as well, as long as they don't override the __repr__, which makes some sense! If you have the same __repr__ you have the same pretty printing behavior.
Here is an example of the pretty printing some MatchPy operations now:
ReshapeVector(Vector(Scalar('1')),
Vector(Index(Vector(Scalar('0')),
If(Scalar('True'),
Scalar("ReshapeVector(Vector(Scalar('2'), Scalar('2')), Iota(Scalar('10')))"),
Scalar("ReshapeVector(Vector(Scalar('2'), Scalar('2')), Ravel(Iota(Scalar('10'))))")))))

Consider using the pretty module:
http://pypi.python.org/pypi/pretty/0.1

Stable python serialization (e.g. no pickle module relocation issues)

I am considering the use of Quantities to define a number together with its unit. This value most likely will have to be stored on the disk. As you are probably aware, pickling has one major issue: if you relocate the module around, unpickling will not be able to resolve the class, and you will not be able to unpickle the information. There are workarounds for this behavior, but they are, indeed, workarounds.
A solution I fantasized for this issue would be to create a string encoding uniquely a given unit. Once you obtain this encoding from the disk, you pass it to a factory method in the Quantities module, which decodes it to a proper unit instance. The advantage is that even if you relocate the module around, everything will still work, as long as you pass the magic string token to the factory method.
Is this a known concept?

Looks like an application of Wheeler's First Principle, "all problems in computer science can be solved by another level of indirection" (the Second Principle adds "but that will usually create another problem";-). Essentially what you need to do is an indirection to identify the type -- entity-within-type will be fine with pickling-like approaches (you can study the sources of pickle.py and copy_reg.py for all the fine details of the latter).
Specifically, I believe that what you want to do is subclass pickle.Pickler and override the save_inst method. Where the current version says:
if self.bin:
save(cls)
for arg in args:
save(arg)
write(OBJ)
else:
for arg in args:
save(arg)
write(INST + cls.__module__ + '\n' + cls.__name__ + '\n')
you want to write something different than just the class's module and name -- some kind of unique identifier (made up of two string) for the class, probably held in your own registry or registries; and similarly for the save_global method.
It's even easier for your subclass of Unpickler, because the _instantiate part is already factored out in its own method: you only need to override find_class, which is:
def find_class(self, module, name):
# Subclasses may override this
__import__(module)
mod = sys.modules[module]
klass = getattr(mod, name)
return klass
it must take two strings and return a class object; you can do that through your registries, again.
Like always when registries are involved, you need to think about how to ensure you register all objects (classes) of interest, etc, etc. One popular strategy here is to leave pickling alone, but ensure that all moves of classes, renames of modules, etc, are recorded somewhere permanent; this way, just the subclassed unpickler can do all the work, and it can most conveniently do it all in the overridden find_class -- bypassing all issues of registration. I gather you consider this a "workaround" but to me it seems just an extremely simple, powerful and convenient implementation of the "one more level of indirection" concept, which avoids the "one more problem" issue;-).

What is the accepted python alternative to C++ overloaded input stream operators?

In C++, you can do this to easily read data into a class:
istream& operator >> (istream& instream, SomeClass& someclass) {
...
}
In python, the only way I can find to read from the console is the "raw_input" function, which isn't very adaptable to this sort of thing. Is there a pythonic way to go about this?

You are essentially looking for deserialization. Python has a myriad of options for this depending on the library used. The default is python pickling. There are lots of other options you can have a look here.

No, there's no widespread Pythonic convention for "read the next instance of class X from this open input text file". I believe this applies to most languages, including e.g. Java; C++ is kind of the outlier there (and many C++ shops forbid the operator>> use in their local style guides). Serialization (to/from JSON or XML if you need allegedly-human readable text files), suggested by another answer, is one possible approach, but not too hot (no standardized way to serialize completely general class instances to either XML or JSON).

Rather than use raw_input, you can read from sys.stdin (a file-like object):
import sys
input_line = sys.stdin.readline()
# do something with input_line

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to test if a file has been created by pickle? - python

Is there any way of checking if a file has been created by pickle? I could just catch exceptions thrown by pickle.load but there is no specific "not a pickle file" exception.

There is no sure way other than to try to unpickle it, and catch exceptions.

Related

Why does linecache check for the length of the tuple elements in the cache?

receive file object or file path

best way to implement custom pretty-printers

Stable python serialization (e.g. no pickle module relocation issues)

What is the accepted python alternative to C++ overloaded input stream operators?

Categories

Resources