Saving an object: replace changes only - python

I have a large object (1.5 GB on the disk) that I save via the python module dill. I perform lengthy operations on the object and want to save the new state of the object once in a while. However, large parts of the object remain unchanged in the operations, and I would like to overwrite the file only where things have changed.
Is there a relatively simple way (e.g. with some existing module) to achieve this task?
My intuitive solution would be to save the object attributes one by one and rebuild the object from there. Changes could be noted after reading an already saved attribute by comparing its value (e.g. via a hash function) with the respective attribute that is to be saved. Alternatively, I could track which attributes have been changed during an operation.
Is there a package for that? Is there an alternative way?
I am working with python 3.7.

I have implemented a module that does - to a large extent - what I was looking for.
Blocks of files are overwritten only if their content has changed.
Attributes can be saved separately, and this works recursively.
If attributes are not saved separately, it is necessary to pickle the object completely to compare it to an object already present on the file system. However, since writing to the disk is often what makes saving large objects so slow, significant speedups can be gained for large objects. The exact speedup depends on the hardware of the storage medium.
The code contains a method save_object that saves any object without overwriting existing identical sections. Furthermore, I have implemented a class SeparatelySaveable that can be used as the base class for all objects for which some attributes shall be saved in separate files. Attributes that are instances of SeparatelySaveable also will be saved separately automatically. Further attributes that shall be saved separately can be specified via SeparatelySaveable.set_save_separately.
Attributes that are saved separately are placed in a folder next to the file to which the original object is saved. These attributes will be saved again only if they have been accessed after they have been saved initially. When the object is loaded, the separate attributes will not be loaded until they are accessed.
The code can be found at the bottom of this answer. The usage is as follows: saving an object without overwriting similar parts works for all objects:
save_object(myObject)
Saving attributes separately:
# defining classes
class MyClass1(SeparatelySaveable):
def __init__(self, value):
super().__init__()
self.attribute1 = value
# specify that self.attribute shall be
# saved separately
self.set_save_separately('attribute')
class MyClass2(SeparatelySaveable):
def __init__(self, value):
super().__init__()
# attributes that are instances of
# SeparatelySaveable will always be saved separately
self.attribute2 = MyClass1(value)
# creating objects
myObject1 = MyClass1()
myObject2 = MyClass2()
# Saves myObject1 to fileName1.ext and
# myObject1.attribute1 to fileName1.arx/attribute1.ext
myObject1.save_object("fileName1", ".ext", ".arx")
# Saves myObject2 to fileName2.ext and
# myObject2.attribute2 to fileName2.arx/attribute2.ext and
# myObject2.attribute2.attribute1 to fileName2.arx/attribute2.arx/attribute1.ext
myObject1.save_object("fileName2", ".ext", ".arx")
# load myObject2; myObject2.attribute2 will remain unloaded
loadedObject = load_object("fileName2.ext")
# myObject2.attribute1 will be loaded; myObject2.attribute2.attribute1
# will remain unloaded
loadedObject.attribute2
# Saves loadedObject to fileName2.ext and
# loadedObject.attribute2 to fileName2.arx/attribute2.ext
# loadedObject.attribute2.attribute1 will remain untouched
loadedObject.save_object("fileName2", ".ext", ".arx")
The code:
import dill
import os
import io
from itertools import count
from astropy.wcs.docstrings import name
DEFAULT_EXTENSION = ''
"""File name extension used if no extension is specified"""
DEFAULT_FOLDER_EXTENSION = '.arx'
"""Folder name extension used if no extension is specified"""
BLOCKSIZE = 2**20
"""Size of read/write blocks when files are saved"""
def load_object(filename):
"""Load an object.
Parameters
----------
filename : str
Path to the file
"""
with open(filename, 'rb') as file:
return dill.load(file)
def save_object(obj, filename, compare=True):
"""Save an object.
If the object has been saved at the same file earlier, only the parts
are overwritten that have changed. Note that an additional attribute
at the beginning of the file will 'shift' all data, making it
necessary to rewrite the entire file.
Parameters
----------
obj : object
Object to be saved
filename : str
Path of the file to which the object shall be saved
compare : bool
Whether only changed parts shall be overwitten. A value of `True` will
be beneficial for large files if no/few changes have been made. A
value of `False` will be faster for small and strongly changed files.
"""
if not compare or not os.path.isfile(filename):
with open(filename, 'wb') as file:
dill.dump(obj, file, byref=True)
return
stream = io.BytesIO()
dill.dump(obj, stream, byref=True)
stream.seek(0)
buf_obj = stream.read(BLOCKSIZE)
with open(filename, 'rb+') as file:
buf_file = file.read(BLOCKSIZE)
for position in count(0, BLOCKSIZE):
if not len(buf_obj) > 0:
file.truncate()
break
elif not buf_obj == buf_file:
file.seek(position)
file.write(buf_obj)
if not len(buf_file) > 0:
file.write(stream.read())
break
buf_file = file.read(BLOCKSIZE)
buf_obj = stream.read(BLOCKSIZE)
class SeparatelySaveable():
def __init__(self, extension=DEFAULT_EXTENSION,
folderExtension=DEFAULT_FOLDER_EXTENSION):
self.__dumped_attributes = {}
self.__archived_attributes = {}
self.extension = extension
self.folderExtension = folderExtension
self.__saveables = set()
def set_save_separately(self, *name):
self.__saveables.update(name)
def del_save_separately(self, *name):
self.__saveables.difference_update(name)
def __getattr__(self, name):
# prevent infinite recursion if object has not been correctly initialized
if (name == '_SeparatelySaveable__archived_attributes' or
name == '_SeparatelySaveable__dumped_attributes'):
raise AttributeError('SeparatelySaveable object has not been '
'initialized properly.')
if name in self.__archived_attributes:
value = self.__archived_attributes.pop(name)
elif name in self.__dumped_attributes:
value = load_object(self.__dumped_attributes.pop(name))
else:
raise AttributeError("'" + type(self).__name__ + "' object "
"has no attribute '" + name + "'")
setattr(self, name, value)
return value
def __delattr__(self, name):
try:
self.__dumped_attributes.pop(name)
try:
super().__delattr__(name)
except AttributeError:
pass
except KeyError:
super().__delattr__(name)
def hasattr(self, name):
if name in self.__dumped_attributes or name in self.__archived_attributes:
return True
else:
return hasattr(self, name)
def load_all(self):
for name in list(self.__archived_attributes):
getattr(self, name)
for name in list(self.__dumped_attributes):
getattr(self, name)
def save_object(self, fileName, extension=None, folderExtension=None,
overwriteChildExtension=False):
if extension is None:
extension = self.extension
if folderExtension is None:
folderExtension = self.folderExtension
# account for a possible name change - load all components
# if necessary; this could be done smarter
if not (self.__dict__.get('_SeparatelySaveable__fileName',
None) == fileName
and self.__dict__.get('_SeparatelySaveable__extension',
None) == extension
and self.__dict__.get('_SeparatelySaveable__folderExtension',
None) == folderExtension
and self.__dict__.get('_SeparatelySaveable__overwriteChildExtension',
None) == overwriteChildExtension
):
self.__fileName = fileName
self.__extension = extension
self.__folderExtension = folderExtension
self.__overwriteChildExtension = overwriteChildExtension
self.load_all()
# do not save the attributes that had been saved earlier and have not
# been accessed since
archived_attributes_tmp = self.__archived_attributes
self.__archived_attributes = {}
# save the object
dumped_attributes_tmp = {}
saveInFolder = False
for name, obj in self.__dict__.items():
if isinstance(obj, SeparatelySaveable) or name in self.__saveables:
if not saveInFolder:
folderName = fileName+folderExtension
if not os.access(folderName, os.F_OK):
os.makedirs(folderName)
saveInFolder = True
partFileName = os.path.join(folderName, name)
if isinstance(obj, SeparatelySaveable):
if overwriteChildExtension:
savedFileName = obj.save_object(partFileName, extension,
folderExtension,
overwriteChildExtension)
else:
savedFileName = obj.save_object(partFileName)
else:
savedFileName = partFileName+extension
save_object(obj, savedFileName)
dumped_attributes_tmp[name] = obj
self.__dumped_attributes[name] = savedFileName
for name in dumped_attributes_tmp:
self.__dict__.pop(name)
save_object(self, fileName+extension)
archived_attributes_tmp.update(dumped_attributes_tmp)
self.__archived_attributes = archived_attributes_tmp
return fileName+extension

Related

Can dynamically created class methods know their 'created' name at runtime?

I have a class which I want to use to extract data from a text file (already parsed) and I want do so using dynamically created class methods, because otherwise there would be a lot of repetitive code. Each created class method shall be asociated with a specific line of the text file, e.g. '.get_name()' --> read a part of 0th line of text file.
My idea was to use a dictionary for the 'to-be-created' method names and corresponding line.
import sys
import inspect
test_file = [['Name=Jon Hancock'],
['Date=16.08.2020'],
['Author=Donald Duck']]
# intented method names
fn_names = {'get_name': 0, 'get_date': 1, 'get_author': 2}
class Filer():
def __init__(self, file):
self.file = file
def __get_line(cls):
name = sys._getframe().f_code.co_name
line = fn_names[name] # <-- causes error because __get_line is not in fn_names
print(sys._getframe().f_code.co_name) # <-- '__get_line'
print(inspect.currentframe().f_code.co_name) # <-- '__get_line'
return print(cls.file[line][0].split('=')[1])
for key, val in fn_names.items():
setattr(Filer, key, __get_line)
f = Filer(test_file)
f.get_author()
f.get_date()
When I try to access the method name to link the method to the designated line in the text file, I do get an error because the method name is always '__get_line' instead of e.g. 'get_author' (what I had hoped for).
Another way how I thought to solve this was to make '__get_line' accepting an additional argument (line) and set it by passing the val during 'the setattr()' as shown below:
def __get_line(cls, line):
return print(cls.file[line][0].split('=')[1])
and
for key, val in fn_names.items():
setattr(Filer, key, __get_line(val))
however, then Python complains that 1 argument (line) is missing.
Any ideas how to solve that?
I would propose a much simpler solution, based on some assumptions. Your file appears to consist of key-value pairs. You are choosing to map the line number to a function that processes the right hand side of the line past the = symbol. Python does not conventionally use getters. Attributes are much nicer and easier to use. You can have getter-like functionality by using property objects, but you really don't need that here.
class Filer():
def __init__(self, file):
self.file = file
for line in file:
name, value = line[0].split('=', 1)
setattr(self, name.lower(), value)
That's all you need. Now you can use the result:
>>> f = Filer(test_file)
>>> f.author
'Donald Duck'
If you want to have callable methods exactly like the one you propose for each attribute, I would one-up your proposal and not even have a method to begin with. You can actually generate the methods on the fly in __getattr__:
class Filer():
def __init__(self, file):
self.file = file
def __getattr__(self, name):
if name in fn_names:
index = fn_names[name]
def func(self):
print(self.file[index][0].split('=', 1)[1])
func.__name__ = func.__qualname__ = name
return func.__get__(self, type(self))
return super().__getattr__(name)
Calling __get__ is an extra step that makes the function behave as if it were a method of the class all along. It binds the function object to the instance, even through the function is not part of the class.
For example:
>>> f = Filer(test_file)
>>> f.get_author
<bound method get_author of <__main__.Filer object at 0x0000023E7A247748>>
>>> f.get_author()
'Donald Duck'
Consider closing over your keys and values -- note that you can see the below code running at https://ideone.com/qmoZCJ:
import sys
import inspect
test_file = [['Name=Jon Hancock'],
['Date=16.08.2020'],
['Author=Donald Duck']]
# intented method names
fn_names = {'get_name': 0, 'get_date': 1, 'get_author': 2}
class Filer():
def __init__(self, file):
self.file = file
def getter(key, val):
def _get_line(self):
return self.file[val][0].split('=')[1]
return _get_line
for key, val in fn_names.items():
setattr(Filer, key, getter(key, val))
f = Filer(test_file)
print("Author: ", f.get_author())
print("Date: ", f.get_date())

How to clone a class but not copy mutable attributes?

I am trying to write python code that will be able to track the created instances of a class, and save it through sessions. I am trying to to this by creating a list inside the class deceleration, which keeps track of instances. My code is as follows:
class test_object:
_tracking = []
def __init__(self, text):
self.name = text
test_object._tracking.insert(0, self)
with open("tst.pkl", mode="rb") as f:
try:
pickles = dill.load(f)
except:
pickles = test_object
logger.warning("Found dill to be empty")
f.close()
My issue is handling when the pickled data is empty. What I'd like to do is in this case simply use the base class. The issue I'm running into is that test_object._tracking ends up being equal to pickles._tracking. Is there a way to make a copy of test_object so that when test_object._tracking gets updates, pickles._tracking stays the same?
You can do the following:
import dill
class test_object:
_tracking = []
def __init__(self, text):
self.name = text
test_object._tracking.insert(0, self)
test_1 = test_object("abc")
print(test_object._tracking)
# result: [<__main__.test_object object at 0x11a8cda50>]
with open("./my_file.txt", mode="rb") as f:
try:
pickles = dill.load(f)
except:
pickles = type('test_object_copy', test_object.__bases__, dict(test_object.__dict__))
pickles._tracking = []
print("Found dill to be empty")
# The above results in "Found dill to be empty"
print(pickles._tracking)
# prints []
it'll set pickles to a copy of the original class. It's tracking attribute would then be empty, and would be different than the original 'tracking'.

Python function call within a property of a class

I am trying to use the "setx" function of a Property in a Class to do some processing of date information that I get from excel. I have a few of my own functions that do the data processing which I tested outside the class, and they worked just fine. But when I move them into the class they suddenly become invisible unless I use the self. instance first. When I use the self.My_xldate_as_tuple() method I get an error:
My_xldate_as_tuple() takes 1 positional argument but 2 were given
Even though the code is EXACTLY what i used outside the class before and it worked.
Before moving into the Property Set block, I was doing the processing of date data outside of the class and setting the variables from outside of the class. That gets clunky when I have about 15 different operations that are all based on when the NumDates Property change. I'm showing shortened versions of both the working set of code and the non-working set of code. What is going on with the self. call that changes how the function takes inputs?
Broken Code:
class XLDataClass(object):
_NumDates = []
TupDates = []
def getNumDates(self): return self._NumDates
def setNumDates(self, value):
self._NumDates = value
self.TupDates = list(map(self.My_xldate_as_tuple,value)) #Error here
#This version doesn't work either, since it can't find My_xldate_as_tuple anymore
self.TupDates = list(map(My_xldate_as_tuple,value))
def delNumDates(self):del self._NumDates
NumDates = property(getNumDates,setNumDates,delNumDates,"Ordinal Dates")
#exact copy of the My_xldate_as_tuple function that works outside the class
def My_xldate_as_tuple(Date):
return xlrd.xldate_as_tuple(Date,1)
#Other code and functions here
#end XlDataClass
def GetExcelData(filename,rowNum,titleCol):
csv = np.genfromtxt(filename, delimiter= ",")
NumDates = deque(csv[rowNum,:])
if titleCol == True:
NumDates.popleft()
return NumDates
#Setup
filedir = "C:/Users/blahblahblah"
filename = filedir + "/SamplePandL.csv"
xlData = XLDataClass()
#Put csv data into xlData object
xlData.NumDates= GetExcelData(filename,0,1)
Working Code:
class XLDataClass(object):
NumDates = []
TupDates = []
#Other code and functions here
#end XlDataClass
#exact copy of the same function outside of the class, which works here
def My_xldate_as_tuple(Date):
return xlrd.xldate_as_tuple(Date,1)
def GetExcelData(filename,rowNum,titleCol):
csv = np.genfromtxt(filename, delimiter= ",")
NumDates = deque(csv[rowNum,:])
if titleCol == True:
NumDates.popleft()
return NumDates
#Setup
filedir = "C:/Users/blahblahblah"
filename = filedir + "/SamplePandL.csv"
xlData = XLDataClass()
#Put csv data into xlData object
xlData.NumDates = GetExcelData(filename,0,1)
#same call to the function that was inside the Setx Property of the class, but it works here.
xlData.TupDates = list(map(self.My_xldate_as_tuple,value))
Instance methods in Python require an explicit self in the argument list. Inside the class, you need to write your method definition like:
def My_xldate_as_tuple(self, Date):

cPickle.load() in python consumes a large memory

I have a large dictionary whose structure looks like:
dcPaths = {'id_jola_001': CPath instance}
where CPath is a self-defined class:
class CPath(object):
def __init__(self):
# some attributes
self.m_dAvgSpeed = 0.0
...
# a list of CNode instance
self.m_lsNodes = []
where m_lsNodes is a list of CNode:
class CNode(object):
def __init__(self):
# some attributes
self.m_nLoc = 0
# a list of Apps
self.m_lsApps = []
Here, m_lsApps is a list of CApp, which is another self-defined class:
class CApp(object):
def __init__(self):
# some attributes
self.m_nCount= 0
self.m_nUpPackets = 0
I serialize this dictionary by using cPickle:
def serialize2File(strFileName, strOutDir, obj):
if len(obj) != 0:
strOutFilePath = "%s%s" % (strOutDir, strFileName)
with open(strOutFilePath, 'w') as hOutFile:
cPickle.dump(obj, hOutFile, protocol=0)
return strOutFilePath
else:
print("Nothing to serialize!")
It works fine and the size of serialized file is about 6.8GB. However, when I try to deserialize this object:
def deserializeFromFile(strFilePath):
obj = 0
with open(strFilePath) as hFile:
obj = cPickle.load(hFile)
return obj
I find it consumes more than 90GB memory and takes a long time.
why would this happen?
Is there any way I could optimize this?
BTW, I'm using python 2.7.6
You can try specifying the pickle protocol; fastest is -1 (meaning: latest
protocol, no problem if you are pickling and unpickling with the same Python version).
cPickle.dump(obj, file, protocol = -1)
EDIT:
As said in the comments: load detects the protocol itself.
cPickle.load(obj, file)
When you store complex python objects, python usually stores a lot of useless data (look at the __dict__ object property).
In order to reduce the memory consumption of unserialized data you should pickle only python natives. You can achieve this easily implementing some methods on your classes: object.__getstate__() and object.__setstate__(state).
See Pickling and unpickling normal class instances on python documentation.

Python: Pickling a dict with some unpicklable items

I have an object gui_project which has an attribute .namespace, which is a namespace dict. (i.e. a dict from strings to objects.)
(This is used in an IDE-like program to let the user define his own object in a Python shell.)
I want to pickle this gui_project, along with the namespace. Problem is, some objects in the namespace (i.e. values of the .namespace dict) are not picklable objects. For example, some of them refer to wxPython widgets.
I'd like to filter out the unpicklable objects, that is, exclude them from the pickled version.
How can I do this?
(One thing I tried is to go one by one on the values and try to pickle them, but some infinite recursion happened, and I need to be safe from that.)
(I do implement a GuiProject.__getstate__ method right now, to get rid of other unpicklable stuff besides namespace.)
I would use the pickler's documented support for persistent object references. Persistent object references are objects that are referenced by the pickle but not stored in the pickle.
http://docs.python.org/library/pickle.html#pickling-and-unpickling-external-objects
ZODB has used this API for years, so it's very stable. When unpickling, you can replace the object references with anything you like. In your case, you would want to replace the object references with markers indicating that the objects could not be pickled.
You could start with something like this (untested):
import cPickle
def persistent_id(obj):
if isinstance(obj, wxObject):
return "filtered:wxObject"
else:
return None
class FilteredObject:
def __init__(self, about):
self.about = about
def __repr__(self):
return 'FilteredObject(%s)' % repr(self.about)
def persistent_load(obj_id):
if obj_id.startswith('filtered:'):
return FilteredObject(obj_id[9:])
else:
raise cPickle.UnpicklingError('Invalid persistent id')
def dump_filtered(obj, file):
p = cPickle.Pickler(file)
p.persistent_id = persistent_id
p.dump(obj)
def load_filtered(file)
u = cPickle.Unpickler(file)
u.persistent_load = persistent_load
return u.load()
Then just call dump_filtered() and load_filtered() instead of pickle.dump() and pickle.load(). wxPython objects will be pickled as persistent IDs, to be replaced with FilteredObjects at unpickling time.
You could make the solution more generic by filtering out objects that are not of the built-in types and have no __getstate__ method.
Update (15 Nov 2010): Here is a way to achieve the same thing with wrapper classes. Using wrapper classes instead of subclasses, it's possible to stay within the documented API.
from cPickle import Pickler, Unpickler, UnpicklingError
class FilteredObject:
def __init__(self, about):
self.about = about
def __repr__(self):
return 'FilteredObject(%s)' % repr(self.about)
class MyPickler(object):
def __init__(self, file, protocol=0):
pickler = Pickler(file, protocol)
pickler.persistent_id = self.persistent_id
self.dump = pickler.dump
self.clear_memo = pickler.clear_memo
def persistent_id(self, obj):
if not hasattr(obj, '__getstate__') and not isinstance(obj,
(basestring, int, long, float, tuple, list, set, dict)):
return "filtered:%s" % type(obj)
else:
return None
class MyUnpickler(object):
def __init__(self, file):
unpickler = Unpickler(file)
unpickler.persistent_load = self.persistent_load
self.load = unpickler.load
self.noload = unpickler.noload
def persistent_load(self, obj_id):
if obj_id.startswith('filtered:'):
return FilteredObject(obj_id[9:])
else:
raise UnpicklingError('Invalid persistent id')
if __name__ == '__main__':
from cStringIO import StringIO
class UnpickleableThing(object):
pass
f = StringIO()
p = MyPickler(f)
p.dump({'a': 1, 'b': UnpickleableThing()})
f.seek(0)
u = MyUnpickler(f)
obj = u.load()
print obj
assert obj['a'] == 1
assert isinstance(obj['b'], FilteredObject)
assert obj['b'].about
This is how I would do this (I did something similar before and it worked):
Write a function that determines whether or not an object is pickleable
Make a list of all the pickleable variables, based on the above function
Make a new dictionary (called D) that stores all the non-pickleable variables
For each variable in D (this only works if you have very similar variables in d)
make a list of strings, where each string is legal python code, such that
when all these strings are executed in order, you get the desired variable
Now, when you unpickle, you get back all the variables that were originally pickleable. For all variables that were not pickleable, you now have a list of strings (legal python code) that when executed in order, gives you the desired variable.
Hope this helps
I ended up coding my own solution to this, using Shane Hathaway's approach.
Here's the code. (Look for CutePickler and CuteUnpickler.) Here are the tests. It's part of GarlicSim, so you can use it by installing garlicsim and doing from garlicsim.general_misc import pickle_tools.
If you want to use it on Python 3 code, use the Python 3 fork of garlicsim.
One approach would be to inherit from pickle.Pickler, and override the save_dict() method. Copy it from the base class, which reads like this:
def save_dict(self, obj):
write = self.write
if self.bin:
write(EMPTY_DICT)
else: # proto 0 -- can't use EMPTY_DICT
write(MARK + DICT)
self.memoize(obj)
self._batch_setitems(obj.iteritems())
However, in the _batch_setitems, pass an iterator that filters out all items that you don't want to be dumped, e.g
def save_dict(self, obj):
write = self.write
if self.bin:
write(EMPTY_DICT)
else: # proto 0 -- can't use EMPTY_DICT
write(MARK + DICT)
self.memoize(obj)
self._batch_setitems(item for item in obj.iteritems()
if not isinstance(item[1], bad_type))
As save_dict isn't an official API, you need to check for each new Python version whether this override is still correct.
The filtering part is indeed tricky. Using simple tricks, you can easily get the pickle to work. However, you might end up filtering out too much and losing information that you could keep when the filter looks a little bit deeper. But the vast possibility of things that can end up in the .namespace makes building a good filter difficult.
However, we could leverage pieces that are already part of Python, such as deepcopy in the copy module.
I made a copy of the stock copy module, and did the following things:
create a new type named LostObject to represent object that will be lost in pickling.
change _deepcopy_atomic to make sure x is picklable. If it's not, return an instance of LostObject
objects can define methods __reduce__ and/or __reduce_ex__ to provide hint about whether and how to pickle it. We make sure these methods will not throw exception to provide hint that it cannot be pickled.
to avoid making unnecessary copy of big object (a la actual deepcopy), we recursively check whether an object is picklable, and only make unpicklable part. For instance, for a tuple of a picklable list and and an unpickable object, we will make a copy of the tuple - just the container - but not its member list.
The following is the diff:
[~/Development/scratch/] $ diff -uN /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/copy.py mcopy.py
--- /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/copy.py 2010-01-09 00:18:38.000000000 -0800
+++ mcopy.py 2010-11-10 08:50:26.000000000 -0800
## -157,6 +157,13 ##
cls = type(x)
+ # if x is picklable, there is no need to make a new copy, just ref it
+ try:
+ dumps(x)
+ return x
+ except TypeError:
+ pass
+
copier = _deepcopy_dispatch.get(cls)
if copier:
y = copier(x, memo)
## -179,10 +186,18 ##
reductor = getattr(x, "__reduce_ex__", None)
if reductor:
rv = reductor(2)
+ try:
+ x.__reduce_ex__()
+ except TypeError:
+ rv = LostObject, tuple()
else:
reductor = getattr(x, "__reduce__", None)
if reductor:
rv = reductor()
+ try:
+ x.__reduce__()
+ except TypeError:
+ rv = LostObject, tuple()
else:
raise Error(
"un(deep)copyable object of type %s" % cls)
## -194,7 +209,12 ##
_deepcopy_dispatch = d = {}
+from pickle import dumps
+class LostObject(object): pass
def _deepcopy_atomic(x, memo):
+ try:
+ dumps(x)
+ except TypeError: return LostObject()
return x
d[type(None)] = _deepcopy_atomic
d[type(Ellipsis)] = _deepcopy_atomic
Now back to the pickling part. You simply make a deepcopy using this new deepcopy function and then pickle the copy. The unpicklable parts have been removed during the copying process.
x = dict(a=1)
xx = dict(x=x)
x['xx'] = xx
x['f'] = file('/tmp/1', 'w')
class List():
def __init__(self, *args, **kwargs):
print 'making a copy of a list'
self.data = list(*args, **kwargs)
x['large'] = List(range(1000))
# now x contains a loop and a unpickable file object
# the following line will throw
from pickle import dumps, loads
try:
dumps(x)
except TypeError:
print 'yes, it throws'
def check_picklable(x):
try:
dumps(x)
except TypeError:
return False
return True
class LostObject(object): pass
from mcopy import deepcopy
# though x has a big List object, this deepcopy will not make a new copy of it
c = deepcopy(x)
dumps(c)
cc = loads(dumps(c))
# check loop refrence
if cc['xx']['x'] == cc:
print 'yes, loop reference is preserved'
# check unpickable part
if isinstance(cc['f'], LostObject):
print 'unpicklable part is now an instance of LostObject'
# check large object
if loads(dumps(c))['large'].data[999] == x['large'].data[999]:
print 'large object is ok'
Here is the output:
making a copy of a list
yes, it throws
yes, loop reference is preserved
unpicklable part is now an instance of LostObject
large object is ok
You see that 1) mutual pointers (between x and xx) are preserved and we do not run into infinite loop; 2) the unpicklable file object is converted to a LostObject instance; and 3) not new copy of the large object is created since it is picklable.

Categories

Resources