I am trying to write python code that will be able to track the created instances of a class, and save it through sessions. I am trying to to this by creating a list inside the class deceleration, which keeps track of instances. My code is as follows:
class test_object:
_tracking = []
def __init__(self, text):
self.name = text
test_object._tracking.insert(0, self)
with open("tst.pkl", mode="rb") as f:
try:
pickles = dill.load(f)
except:
pickles = test_object
logger.warning("Found dill to be empty")
f.close()
My issue is handling when the pickled data is empty. What I'd like to do is in this case simply use the base class. The issue I'm running into is that test_object._tracking ends up being equal to pickles._tracking. Is there a way to make a copy of test_object so that when test_object._tracking gets updates, pickles._tracking stays the same?
You can do the following:
import dill
class test_object:
_tracking = []
def __init__(self, text):
self.name = text
test_object._tracking.insert(0, self)
test_1 = test_object("abc")
print(test_object._tracking)
# result: [<__main__.test_object object at 0x11a8cda50>]
with open("./my_file.txt", mode="rb") as f:
try:
pickles = dill.load(f)
except:
pickles = type('test_object_copy', test_object.__bases__, dict(test_object.__dict__))
pickles._tracking = []
print("Found dill to be empty")
# The above results in "Found dill to be empty"
print(pickles._tracking)
# prints []
it'll set pickles to a copy of the original class. It's tracking attribute would then be empty, and would be different than the original 'tracking'.
Related
In the code below as you can see I have a test class that inherits from the sqliteDict class. There is also a get_term() method that returns the keys for the dictionary. In the main part, first I make an instance of the class and try to make a new sqliteDict file and assign simple data to it through a context manager block. Until now everything works great but when I try to read the data through the second context manager block from the same file, it seems the data is not saved in the file.
from collections import defaultdict
from sqlitedict import SqliteDict
class test(SqliteDict):
def __init__(self, filename: str = "inverted_index.sqlite", new = False):
super().__init__(filename, flag="n" if new else "c")
self._index = defaultdict(list) if new else self
def get_terms(self):
"""Returns all unique terms in the index."""
return self._index.keys()
if __name__ == "__main__":
with test("test.sqlite",new=True) as d:
d._index["test"]= ["ok"]
print("first attempt: ", [t for t in d.get_terms()])
d.commit()
with test("test.sqlite", new=False) as f:
print("second attempt: ",[t for t in f.get_terms()])
and the result is:
first attempt: ['test']
second attempt: []
I have a large object (1.5 GB on the disk) that I save via the python module dill. I perform lengthy operations on the object and want to save the new state of the object once in a while. However, large parts of the object remain unchanged in the operations, and I would like to overwrite the file only where things have changed.
Is there a relatively simple way (e.g. with some existing module) to achieve this task?
My intuitive solution would be to save the object attributes one by one and rebuild the object from there. Changes could be noted after reading an already saved attribute by comparing its value (e.g. via a hash function) with the respective attribute that is to be saved. Alternatively, I could track which attributes have been changed during an operation.
Is there a package for that? Is there an alternative way?
I am working with python 3.7.
I have implemented a module that does - to a large extent - what I was looking for.
Blocks of files are overwritten only if their content has changed.
Attributes can be saved separately, and this works recursively.
If attributes are not saved separately, it is necessary to pickle the object completely to compare it to an object already present on the file system. However, since writing to the disk is often what makes saving large objects so slow, significant speedups can be gained for large objects. The exact speedup depends on the hardware of the storage medium.
The code contains a method save_object that saves any object without overwriting existing identical sections. Furthermore, I have implemented a class SeparatelySaveable that can be used as the base class for all objects for which some attributes shall be saved in separate files. Attributes that are instances of SeparatelySaveable also will be saved separately automatically. Further attributes that shall be saved separately can be specified via SeparatelySaveable.set_save_separately.
Attributes that are saved separately are placed in a folder next to the file to which the original object is saved. These attributes will be saved again only if they have been accessed after they have been saved initially. When the object is loaded, the separate attributes will not be loaded until they are accessed.
The code can be found at the bottom of this answer. The usage is as follows: saving an object without overwriting similar parts works for all objects:
save_object(myObject)
Saving attributes separately:
# defining classes
class MyClass1(SeparatelySaveable):
def __init__(self, value):
super().__init__()
self.attribute1 = value
# specify that self.attribute shall be
# saved separately
self.set_save_separately('attribute')
class MyClass2(SeparatelySaveable):
def __init__(self, value):
super().__init__()
# attributes that are instances of
# SeparatelySaveable will always be saved separately
self.attribute2 = MyClass1(value)
# creating objects
myObject1 = MyClass1()
myObject2 = MyClass2()
# Saves myObject1 to fileName1.ext and
# myObject1.attribute1 to fileName1.arx/attribute1.ext
myObject1.save_object("fileName1", ".ext", ".arx")
# Saves myObject2 to fileName2.ext and
# myObject2.attribute2 to fileName2.arx/attribute2.ext and
# myObject2.attribute2.attribute1 to fileName2.arx/attribute2.arx/attribute1.ext
myObject1.save_object("fileName2", ".ext", ".arx")
# load myObject2; myObject2.attribute2 will remain unloaded
loadedObject = load_object("fileName2.ext")
# myObject2.attribute1 will be loaded; myObject2.attribute2.attribute1
# will remain unloaded
loadedObject.attribute2
# Saves loadedObject to fileName2.ext and
# loadedObject.attribute2 to fileName2.arx/attribute2.ext
# loadedObject.attribute2.attribute1 will remain untouched
loadedObject.save_object("fileName2", ".ext", ".arx")
The code:
import dill
import os
import io
from itertools import count
from astropy.wcs.docstrings import name
DEFAULT_EXTENSION = ''
"""File name extension used if no extension is specified"""
DEFAULT_FOLDER_EXTENSION = '.arx'
"""Folder name extension used if no extension is specified"""
BLOCKSIZE = 2**20
"""Size of read/write blocks when files are saved"""
def load_object(filename):
"""Load an object.
Parameters
----------
filename : str
Path to the file
"""
with open(filename, 'rb') as file:
return dill.load(file)
def save_object(obj, filename, compare=True):
"""Save an object.
If the object has been saved at the same file earlier, only the parts
are overwritten that have changed. Note that an additional attribute
at the beginning of the file will 'shift' all data, making it
necessary to rewrite the entire file.
Parameters
----------
obj : object
Object to be saved
filename : str
Path of the file to which the object shall be saved
compare : bool
Whether only changed parts shall be overwitten. A value of `True` will
be beneficial for large files if no/few changes have been made. A
value of `False` will be faster for small and strongly changed files.
"""
if not compare or not os.path.isfile(filename):
with open(filename, 'wb') as file:
dill.dump(obj, file, byref=True)
return
stream = io.BytesIO()
dill.dump(obj, stream, byref=True)
stream.seek(0)
buf_obj = stream.read(BLOCKSIZE)
with open(filename, 'rb+') as file:
buf_file = file.read(BLOCKSIZE)
for position in count(0, BLOCKSIZE):
if not len(buf_obj) > 0:
file.truncate()
break
elif not buf_obj == buf_file:
file.seek(position)
file.write(buf_obj)
if not len(buf_file) > 0:
file.write(stream.read())
break
buf_file = file.read(BLOCKSIZE)
buf_obj = stream.read(BLOCKSIZE)
class SeparatelySaveable():
def __init__(self, extension=DEFAULT_EXTENSION,
folderExtension=DEFAULT_FOLDER_EXTENSION):
self.__dumped_attributes = {}
self.__archived_attributes = {}
self.extension = extension
self.folderExtension = folderExtension
self.__saveables = set()
def set_save_separately(self, *name):
self.__saveables.update(name)
def del_save_separately(self, *name):
self.__saveables.difference_update(name)
def __getattr__(self, name):
# prevent infinite recursion if object has not been correctly initialized
if (name == '_SeparatelySaveable__archived_attributes' or
name == '_SeparatelySaveable__dumped_attributes'):
raise AttributeError('SeparatelySaveable object has not been '
'initialized properly.')
if name in self.__archived_attributes:
value = self.__archived_attributes.pop(name)
elif name in self.__dumped_attributes:
value = load_object(self.__dumped_attributes.pop(name))
else:
raise AttributeError("'" + type(self).__name__ + "' object "
"has no attribute '" + name + "'")
setattr(self, name, value)
return value
def __delattr__(self, name):
try:
self.__dumped_attributes.pop(name)
try:
super().__delattr__(name)
except AttributeError:
pass
except KeyError:
super().__delattr__(name)
def hasattr(self, name):
if name in self.__dumped_attributes or name in self.__archived_attributes:
return True
else:
return hasattr(self, name)
def load_all(self):
for name in list(self.__archived_attributes):
getattr(self, name)
for name in list(self.__dumped_attributes):
getattr(self, name)
def save_object(self, fileName, extension=None, folderExtension=None,
overwriteChildExtension=False):
if extension is None:
extension = self.extension
if folderExtension is None:
folderExtension = self.folderExtension
# account for a possible name change - load all components
# if necessary; this could be done smarter
if not (self.__dict__.get('_SeparatelySaveable__fileName',
None) == fileName
and self.__dict__.get('_SeparatelySaveable__extension',
None) == extension
and self.__dict__.get('_SeparatelySaveable__folderExtension',
None) == folderExtension
and self.__dict__.get('_SeparatelySaveable__overwriteChildExtension',
None) == overwriteChildExtension
):
self.__fileName = fileName
self.__extension = extension
self.__folderExtension = folderExtension
self.__overwriteChildExtension = overwriteChildExtension
self.load_all()
# do not save the attributes that had been saved earlier and have not
# been accessed since
archived_attributes_tmp = self.__archived_attributes
self.__archived_attributes = {}
# save the object
dumped_attributes_tmp = {}
saveInFolder = False
for name, obj in self.__dict__.items():
if isinstance(obj, SeparatelySaveable) or name in self.__saveables:
if not saveInFolder:
folderName = fileName+folderExtension
if not os.access(folderName, os.F_OK):
os.makedirs(folderName)
saveInFolder = True
partFileName = os.path.join(folderName, name)
if isinstance(obj, SeparatelySaveable):
if overwriteChildExtension:
savedFileName = obj.save_object(partFileName, extension,
folderExtension,
overwriteChildExtension)
else:
savedFileName = obj.save_object(partFileName)
else:
savedFileName = partFileName+extension
save_object(obj, savedFileName)
dumped_attributes_tmp[name] = obj
self.__dumped_attributes[name] = savedFileName
for name in dumped_attributes_tmp:
self.__dict__.pop(name)
save_object(self, fileName+extension)
archived_attributes_tmp.update(dumped_attributes_tmp)
self.__archived_attributes = archived_attributes_tmp
return fileName+extension
I am trying to change my code to a more object oriented format. In doing so I am lost on how to 'visualize' what is happening with multiprocessing and how to solve it. On the one hand, the class should track changes to local variables across functions, but on the other I believe multiprocessing creates a copy of the code which the original instance would not have access to. I need to figure out a way to manipulate classes, within a class, using multiprocessing, and have the parent class retain all manipulated values in the nested classes.
A simple version (OLD CODE):
function runMultProc():
...
dictReports = {}
listReports = ['reportName1.txt', 'reportName2.txt']
tasks = []
pool = multiprocessing.Pool()
for report in listReports:
if report not in dictReports:
dictReports[today][report] = {}
tasks.append(pool.apply_async(worker, args=([report, dictReports[today][report]])))
else:
continue
for task in tasks:
report, currentReportDict = task.get()
dictReports[report] = currentFileDict
function worker(report, currentReportDict):
<Manipulate_reports_dict>
return report, currentReportDict
NEW CODE:
class Transfer():
def __init__(self):
self.masterReportDictionary[<todays_date>] = [reportObj1, reportObj2]
def processReports(self):
self.pool = multiprocessing.Pool()
self.pool.map(processWorker, self.masterReportDictionary[<todays_date>])
self.pool.close()
self.pool.join()
def processWorker(self, report):
# **process and manipulate report, currently no return**
report.name = 'foo'
report.path = '/path/to/report'
class Report():
def init(self):
self.name = ''
self.path = ''
self.errors = {}
self.runTime = ''
self.timeProcessed = ''
self.hashes = {}
self.attempts = 0
I don't think this code does what I need it to do, which is to have it process the list of reports in parallel AND, as processWorker manipulates each report class object, store those results. As I am fairly new to this I was hoping someone could help.
The big difference between the two is that the first one build a dictionary and returned it. The second model shouldn't really be returning anything, I just need for the classes to finish being processed and they should have relevant information within them.
Thanks!
I have a large dictionary whose structure looks like:
dcPaths = {'id_jola_001': CPath instance}
where CPath is a self-defined class:
class CPath(object):
def __init__(self):
# some attributes
self.m_dAvgSpeed = 0.0
...
# a list of CNode instance
self.m_lsNodes = []
where m_lsNodes is a list of CNode:
class CNode(object):
def __init__(self):
# some attributes
self.m_nLoc = 0
# a list of Apps
self.m_lsApps = []
Here, m_lsApps is a list of CApp, which is another self-defined class:
class CApp(object):
def __init__(self):
# some attributes
self.m_nCount= 0
self.m_nUpPackets = 0
I serialize this dictionary by using cPickle:
def serialize2File(strFileName, strOutDir, obj):
if len(obj) != 0:
strOutFilePath = "%s%s" % (strOutDir, strFileName)
with open(strOutFilePath, 'w') as hOutFile:
cPickle.dump(obj, hOutFile, protocol=0)
return strOutFilePath
else:
print("Nothing to serialize!")
It works fine and the size of serialized file is about 6.8GB. However, when I try to deserialize this object:
def deserializeFromFile(strFilePath):
obj = 0
with open(strFilePath) as hFile:
obj = cPickle.load(hFile)
return obj
I find it consumes more than 90GB memory and takes a long time.
why would this happen?
Is there any way I could optimize this?
BTW, I'm using python 2.7.6
You can try specifying the pickle protocol; fastest is -1 (meaning: latest
protocol, no problem if you are pickling and unpickling with the same Python version).
cPickle.dump(obj, file, protocol = -1)
EDIT:
As said in the comments: load detects the protocol itself.
cPickle.load(obj, file)
When you store complex python objects, python usually stores a lot of useless data (look at the __dict__ object property).
In order to reduce the memory consumption of unserialized data you should pickle only python natives. You can achieve this easily implementing some methods on your classes: object.__getstate__() and object.__setstate__(state).
See Pickling and unpickling normal class instances on python documentation.
som = SOM_CLASS() # includes many big difficult data structures
som.hard_work()
som.save_to_disk(filename)
#then later or another program
som = SOM_CLASS()
som.read_from_file(filename)
som.do_anythink_else()
or
som = SOM_CLASS()
save(som)
#...
load(som)
som.work()
what is easiest way to do this?
You can (de)serialize with pickle. It is backward-compatible, i.e. it will support all old protocols in future versions.
import pickle
som = SOM_CLASS()
fileObject = <any file-like object>
pickle.dump(som, fileObject)
#...
som = pickle.load(fileObject)
som.work()
But mind that if you transfer pickled objects to another computer, make sure the connection cannot be tampered with as pickle might be unsecure (this is an article that every pickle user should know).
Another alternative is the older module marshal.
I use this code:
import cPickle
import traceback
class someClass():
def __init__(self):
#set name from variable name. http://stackoverflow.com/questions/1690400/getting-an-instance-name-inside-class-init
(filename,line_number,function_name,text)=traceback.extract_stack()[-2]
def_name = text[:text.find('=')].strip()
self.name = def_name
try:
self.load()
except:
##############
#to demonstrate
self.someAttribute = 'bla'
self.someAttribute2 = ['more']
##############
self.save()
def save(self):
"""save class as self.name.txt"""
file = open(self.name+'.txt','w')
file.write(cPickle.dumps(self.__dict__))
file.close()
def load(self):
"""try load self.name.txt"""
file = open(self.name+'.txt','r')
dataPickle = file.read()
file.close()
self.__dict__ = cPickle.loads(dataPickle)
This code saves and loads the class from its actual class instance name. Code is from my blog http://www.schurpf.com/python-save-a-class/.
Take a look at Python's pickle library.
Use pickle in this way:
import pickle
class Student:
def __init__(self, name, age, grade):
self.name = name
self.age = age
self.grade = grade # 0 - 100
def get_grade(self):
print (self.grade)
s1 = Student("Tim", 19, 95)
#save it
with open(f'test.pickle', 'wb') as file:
pickle.dump(s1, file)
#load it
with open(f'test.pickle', 'rb') as file2:
s1_new = pickle.load(file2)
#check it
s1_new.get_grade()
# it prints 95