I have an application which is working with projects. These projects are currently stored as pickles, generated with
cPickle.dump(project, open(filename, 'wb'), HIGHEST_PROTOCOL)
These project files need to be diffable because they are used in a version control environment.
The problem is, if I serialize the exact same object, the pickle turns out different every time. 0 as protocol works, but I need the files to be smaller (they are around 12MB with protocol 0).
I found the solution, I'm gonna post it here in case anyone has the same question in the future.
The solution is to do a deepcopy of the object directly before it's pickled.
That way the reference count which apparently causes the differences gets reset and the files turn out the same when using HIGHEST_PROTOCOL.
So instead of
cPickle.dump(instance, open(filename, 'wb'), HIGHEST_PROTOCOL)
you need to do this:
from copy import deepcopy
cpy = deepcopy(instance)
cPickle.dump(cpy, open(filename, 'wb'), HIGHEST_PROTOCOL)
cpy = None
That way the file size is reduced significantly while still maintaining comparability.
Related
In my app I'm going to store nodes in relatively small json files. I'm looking for any wrapper which create python object from file (like json.load() do) and then modify related file every time when my app modifies python object.
I expect behaviour like:
Wrapper initialization just associate wrapper to the file path.
node = wrapper(file_path)
Actual reading and parsing of the file occurs on first request.
name = node["name"]
Following read requests will not interact with the file system.
date = node["date"]
Each time when app modifies the object, changes will be written on the disk.
node["name"] = "Jack"
Probably not; the closest that comes to mind is shelve, which uses pickle rather than JSON. dbm is also similar. They each only react to top level changes, so mutable objects can behave surprisingly:
shelf = shelve.open("test.shelf")
shelf["alpha"] = ["one", "two"] # stores alpha = ["one", "two"]
shelf["alpha"].append("three") # copies alpha out of storage,
# modifies the copy, then throws it away. stored alpha is unchanged.
You could certainly write a similar class wrapping a dict with json dump whenever it is written, but it will end up either complicated and costly (by making anything that looks mutable to Python be another wrapper instance) or similarly limited. The pattern isn't all that unusual; dbm and configparser also do similar things. However, I'd advise to think over what you're storing this way; if your program halts unexpectedly it may easily end up erasing the file contents (dbm and sqlite are slightly more resistant, but it's a far more complex subject).
Using Python 2.7.
Is there a way to restore only specified objects from a pickle file?
using the same example as a previous post:
import pickle
# obj0, obj1, obj2 are created here...
# Saving the objects:
with open('objs.pickle', 'w') as f:
pickle.dump([obj0, obj1, obj2], f)
Now, I would like to only restore, say, obj1
I am doing the following:
with open('objs.pickle', 'r') as f:
obj1=pickle.load(f)[1]
but let's say I don't know the order of objects, just it's name.
Writing this, I am guessing the names get dropped during pickling?
Instead of storing the objects in a list, you could use a dictionary to provide names for each object:
import pickle
s = pickle.dumps({'obj0': obj0, 'obj1': obj1, 'obj2': obj2})
obj1 = pickle.loads(s)['obj1']
The order of the items no longer matters, in fact there is no order because a dictionary is being restored.
I'm not 100% sure that this is what you wanted. Were you hoping to restore the object of interest only, i.e. without parsing and restoring the other objects? I don't think that can be done with pickles without writing your own parser, or a fair degree of hacking.
No. Python objects do not have "names" (besides some exceptions like functions and classes that know their declared names), the names are just pointing to the object, and an object does not know its name even at runtime, and thus cannot be persisted in a pickle either.
Perhaps you need a dictionary instead.
Is there any way of checking if a file has been created by pickle? I could just catch exceptions thrown by pickle.load but there is no specific "not a pickle file" exception.
Pickle files don't have a header, so there's no standard way of identifying them short of trying to unpickle one and seeing if any exceptions are raised while doing so.
You could define your own enhanced protocol that included some kind of header by subclassing the Pickler() and Unpickler() classes in the pickle module. However this can't be done with the much faster cPickle module because, in it, they're factory functions, which can't be subclassed [1].
A more flexible approach would be define your own independent classes that used corresponding Pickler() and Unpickler() instances from either one of these modules in its implementation.
Update
The last byte of all pickle files should be the pickle.STOP opcode, so while there isn't a header, there is effectively a very minimal trailer which would be a relatively simple thing to check.
Depending on your exact usage, you might be able to get away with supplementing that with something more elaborate (and longer than one byte), since any data past the STOP opcode in a pickled object's representation is ignored [2].
[1] Footnote [2] in the Python 2 documentation.
[2] Documentation forpickle.loads(), which also applies to pickle.load()since it's currently implemented in terms of the former.
There is no sure way other than to try to unpickle it, and catch exceptions.
I was running into this issue and found a fairly decent way of doing it. You can use the built in pickletools module to deconstruct a pickle file and get the pickle operations. With pickle protocol v2 and higher the first opcode will be a PROTO name and the last one as #martineau mentioned is STOP the following code will display these two opcodes. Note that output in this example can be iterated but opcodes can not be directly accessed thus the for loop.
import pickletools
with open("file.pickle", "rb") as f:
pickle = f.read()
output = pickletools.genops(pickle)
opcodes = []
for opcode in output:
opcodes.append(opcode[0])
print(opcodes[0].name)
print(opcodes[-1].name)
I have some data stored in a DB that I want to process. DB access is painfully slow, so I decided to load all data in a dictionary before any processing. However, due to the huge size of the data stored, I get an out of memory error (I see more than 2 gigs being used). So I decided to use a disk data structure, and found out that using shelve is an option. Here's what I do (pseudo python code)
def loadData():
if (#dict exists on disk):
d = shelve.open(name)
return d
else:
d = shelve.open(name, writeback=True)
#access DB and write data to dict
# d[key] = value
# or for mutable values
# oldValue = d[key]
# newValue = f(oldValue)
# d[key] = newValue
d.close()
d = shelve.open(name, writeback=True)
return d
I have a couple of questions,
1) Do I really need the writeBack=True? What does it do?
2) I still get an OutofMemory exception, since I do not exercise any control over when the data is being written to disk. How do I do that? I tried doing a sync() every few iterations but that didn't help either.
Thanks!
writeback=True forces the shelf to keep in-memory any item ever fetched, and write them back when the shelf is closed. So, it consumes much more memory, and slows down closing.
The advantage of the parameter is that, with it, you don't need the contorted code you show in your comment for mutable items whose mutator is a method -- just
shelf['foobar'].append(23)
works (if shelf was opened with writeback enabled), assuming the item at key 'foobar' is a list of course, while it would silently be a no-operation (leaving the item on disk unchanged) if shelf was opened without writeback -- in the latter case you actually do need to code
thelist = shelf['foobar']
thelist.append(23)
shekf['foobar'] = thelist
in your comment's spirit -- which is stylistically somewhat of a bummer.
However, since you are having memory problems, I definitely recommend not using this dubious writeback option. I think I can call it "dubious" since I was the one proposing and first implementing it, but that was many years ago, and I've mostly repented of doing it -- it generales more confusion (as your Q evidences) than it allows elegance and handiness in moving code originally written to work with dicts (which would use the first idiom, not the second, and thus need rewriting in order to be usable with shelves without traceback). Ah well, sorry, it did seem a good idea at the time.
Using the sqlite3 module is probably your best choice here. You might be able to use sqlite entirely in memory anyway since its memory footprint might be a bit smaller than using python objects anyway. It's generally a better choice than using shelve anyway; shelve uses pickle underneath, which is rarely what you want.
Hell, you could just convert your entire existing database to a sqlite database. sqlite is nice and fast.
What's the difference between file and open in Python? When should I use which one? (Say I'm in 2.5)
You should always use open().
As the documentation states:
When opening a file, it's preferable
to use open() instead of invoking this
constructor directly. file is more
suited to type testing (for example,
writing "isinstance(f, file)").
Also, file() has been removed since Python 3.0.
Two reasons: The python philosophy of "There ought to be one way to do it" and file is going away.
file is the actual type (using e.g. file('myfile.txt') is calling its constructor). open is a factory function that will return a file object.
In python 3.0 file is going to move from being a built-in to being implemented by multiple classes in the io library (somewhat similar to Java with buffered readers, etc.)
file() is a type, like an int or a list. open() is a function for opening files, and will return a file object.
This is an example of when you should use open:
f = open(filename, 'r')
for line in f:
process(line)
f.close()
This is an example of when you should use file:
class LoggingFile(file):
def write(self, data):
sys.stderr.write("Wrote %d bytes\n" % len(data))
super(LoggingFile, self).write(data)
As you can see, there's a good reason for both to exist, and a clear use-case for both.
Functionally, the two are the same; open will call file anyway, so currently the difference is a matter of style. The Python docs recommend using open.
When opening a file, it's preferable to use open() instead of invoking the file constructor directly.
The reason is that in future versions they is not guaranteed to be the same (open will become a factory function, which returns objects of different types depending on the path it's opening).
Only ever use open() for opening files. file() is actually being removed in 3.0, and it's deprecated at the moment. They've had a sort of strange relationship, but file() is going now, so there's no need to worry anymore.
The following is from the Python 2.6 docs. [bracket stuff] added by me.
When opening a file, it’s preferable to use open() instead of invoking this [file()] constructor directly. file is more suited to type testing (for example, writing isinstance(f, file)
According to Mr Van Rossum, although open() is currently an alias for file() you should use open() because this might change in the future.