Python: Incrementally marshal / pickle an object?

Python: Incrementally marshal / pickle an object? - python

I have a large object I'd like to serialize to disk. I'm finding marshal works quite well and is nice and fast.
Right now I'm creating my large object then calling marshal.dump . I'd like to avoid holding the large object in memory if possible - I'd like to dump it incrementally as I build it. Is that possible?
The object is fairly simple, a dictionary of arrays.

The bsddb module's 'hashopen' and 'btopen' functions provide a persistent dictionary-like interface. Perhaps you could use one of these, instead of a regular dictionary, to incrementally serialize the arrays to disk?
import bsddb
import marshal
db = bsddb.hashopen('file.db')
db['array1'] = marshal.dumps(array1)
db['array2'] = marshal.dumps(array2)
...
db.close()
To retrieve the arrays:
db = bsddb.hashopen('file.db')
array1 = marshal.loads(db['array1'])
...

It all your object has to do is be a dictionary of lists, then you may be able to use the shelve module. It presents a dictionary-like interface where the keys and values are stored in a database file instead of in memory. One limitation which may or may not affect you is that keys in Shelf objects must be strings. Value storage will be more efficient if you specify protocol=-1 when creating the Shelf object to have it use a more efficient binary representation.

This very much depends on how you are building the object. Is it an array of sub objects? You could marshal/pickle each array element as you build it. Is it a dictionary? Same idea applies (marshal/pickle keys)
If it is just a big complex harry object, you might want to marshal dump each piece of the object, and then the apply what ever your 'building' process is when you read it back in.

You should be able to dump the item piece by piece to the file. The two design questions that need settling are:
How are you building the object when you're putting it in memory?
How do you need you're data when it comes out of memory?
If your build process populates the entire array associated with a given key at a time, you might just dump the key:array pair in a file as a separate dictionary:
big_hairy_dictionary['sample_key'] = pre_existing_array
marshal.dump({'sample_key':big_hairy_dictionary['sample_key']},'central_file')
Then on update, each call to marshal.load('central_file') will return a dictionary that you can use to update a central dictionary. But this is really only going to be helpful if, when you need the data back, you want to handle reading 'central_file' once per key.
Alternately, if you are populating arrays element by element in no particular order, maybe try:
big_hairy_dictionary['sample_key'].append(single_element)
marshal.dump(single_element,'marshaled_files/'+'sample_key')
Then, when you load it back, you don't necessarily need to build the entire dictionary to get back what you need; you just call marshal.load('marshaled_files/sample_key') until it returns None, and you have everything associated with the key.

Related

Hook into pickle to obtain pre-binary representation?

If I understand correctly, pickle converts the state of an object into something like a dict including the class of the object, and then writes that data to a binary file. Obtaining the state of an object is done via a complex interface, in the simplest case accessing the object's __dict__ but possibly involving user-defined methods like __getstate__, __setstate__, etc. . When a pickle file is loaded, the binary data is read into a dict-like representation, and these converted back into objects.
My question: Is it possible to hook into pickle at the point after obtaining the object state but before writing the binary data, and the same in the other direction (after reading binary data but before restoring objects)?
Background: I'm thinking of implementing something similar to jsonpickle and hickle, i.e. having the same interface of dump and load, but using another file format to store data (here: JSON & HDF5). If possible, I would like to avoid reproducing the lengths pickle goes to in accessing and restoring object states but reuse that part, and only create a new "backend".
A solution using dill would be just as good.

If you dumps an object, and look at the module pickle.py: https://github.com/python/cpython/blob/3.9/Lib/pickle.py#L107, you'll see that pickle converts an object to a series of opcodes (and recursively stored data). This is what is basically what is written to disk when you use dump. I authored the part of hickle that stores arbitrary objects -- by first using dill.dumps to generate a string of optcodes and data, then using HDF to store the string. If you turn on tracing in dill, you can see how the opcodes and data are stored in the string.
>>> x = dict(a=[1,2,3], b=set((4,5,6)))
>>> import dill
>>> dill.detect.trace(True)
>>> dill.dumps(x)
D2: <dict object at 0x11023c870>
T1: <class 'set'>
F2: <function _load_type at 0x11070f2f0>
# F2
# T1
# D2
b'\x80\x03}q\x00(X\x01\x00\x00\x00aq\x01]q\x02(K\x01K\x02K\x03eX\x01\x00\x00\x00bq\x03cdill._dill\n_load_type\nq\x04X\x03\x00\x00\x00setq\x05\x85q\x06Rq\x07]q\x08(K\x04K\x05K\x06e\x85q\tRq\nu.'
It creates a dict, which stores a list of ints (no special function needed), then stores a special function (load_type) to help reconstitute the set, and finally stores the set of ints. Optcodes at the beginning signify the version and protocol.
So, yes, you can access the state (in serialized form) before it is dumped to file.

Stepwise creation of a YAML file

I am facing the following problem: I create a big data set (several 10GB) of python objects. I want to create an output file in YAML format containing an entry for each object that contains information about the object saved as a nested dictionary. However, I never hold all data in memory at the same time.
The output data should be stored in a dictionary mapping an object name to the saved values. A simple version would look like this:
object_1:
value_1: 42
value_2: 23
object_2:
value_1: 17
value_2: 13
[...]
object_a_lot:
value_1: 47
value_2: 11
To keep a low memory footprint, I would like to write the entry for each object and immediately delete it after writing. My current approach is as follows:
from yaml import dump
[...] # initialize huge_object_list. Here it is still small
with open("output.yaml", "w") as yaml_file:
for my_object in huge_object_list:
my_object.compute() # this blows up the size of the object
# create a single entry for the top level dict
object_entry = dump(
{my_object.name: my_object.get_yaml_data()},
default_flow_style=False,
)
yaml_file.write(object_entry)
my_object.delete_big_stuff() # delete the memory consuming stuff in the object, keep other information which is needed later
Basically I am writing several dictionaries, but each only has one key and since the object names are unique this does not blow up. This works, but feels like a bit of a hack and I would like to ask if someone knows of a way to do this better/ proper.
Is there a way to write a big dictionary to a YAML file, one entry at a time?

If you want to write out a YAML file in stages, you can do it the way you describe.
If your keys are not guaranteed to be unique, then I would recommend using a sequence (i.e. list a the top-level (even with one item), instead of a mapping.
This doesn't solve the problem of re-reading the file as PyYAML will try to read the file as a whole and that is not going load quickly, and keep in mind that the memory overhead of PyYAML will require for loading a file can easily be over 100x (a hundred times) the file size. My ruamel.yaml is wrt to memory somewhat better but still requires several tens of times the file size in memory.
You can of course cut up a file based on "leading" spaces, a new key (or dash for an item in case you use sequences) can be easily found in a different way. You can also look at storing each key-value pair in its own document within one file, that vastly reduces the overhead during loading if you combine the key-value pairs of the single documents yourself.
In similar situations I stored individual YAML "objects" in different files, using the filenames as keys to the "object" values. This requires some efficient filesystem (e.g. tail packing) and depends on what is available based on the OS your system is based on.

How to create a memoryview for a non-contiguous memory location?

I have a fragmented structure in memory and I'd like to access it as a contiguous-looking memoryview. Is there an easy way to do this or should I implement my own solution?
For example, consider a file format that consists of records. Each record has a fixed length header, that specifies the length of the content of the record. A higher level logical structure may spread over several records. It would make implementing the higher level structure easier if it could see it's own fragmented memory location as a simple contiguous array of bytes.
Update:
It seems that python supports this 'segmented' buffer type internally, at least based on this part of the documentation. But this is only the C API.
Update2:
As far as I see, the referenced C API - called old-style buffers - does what I need, but it's now deprecated and unavailable in newer version of Python (3.X). The new buffer protocol - specified in PEP 3118 - offers a new way to represent buffers. This API is more usable in most of the use cases (among them, use cases where the represented buffer is not contiguous in memory), but does not support this specific one, where a one dimensional array may be laid out completely freely (multiple differently sized chunks) in memory.

First - I am assuming you are just trying to do this in pure python rather than in a c extension. So I am assuming you have loaded in the different records you are interested in into a set of python objects and your problem is that you want to see the higher level structure that is spread across these objects with bits here and there throughout the objects.
So can you not simply load each of the records into a byte arrays type? You can then use python slicing of arrays to create a new array that has just the data for the high level structure you are interested in. You will then have a single byte array with just the data you are interested in and can print it out or manipulate it in any way that you want to.
So something like:
a = bytearray(b"Hello World") # put your records into byte arrays like this
b = bytearray(b"Stack Overflow")
complexStructure = bytearray(a[0:6]+b[0:]) # Slice and join arrays to form
# new array with just data from your
# high level entity
print complexStructure
Of course you will still ned to know where within the records your high level structure is to slice the arrays correctly but you would need to know this anyway.
EDIT:
Note taking a slice of a list does not copy the data in the list it just creates a new set of references to the data so:
>>> a = [1,2,3]
>>> b = a[1:3]
>>> id(a[1])
140268972083088
>>> id(b[0])
140268972083088
However changes to the list b will not change a as b is a new list. To have the changes automatically change in the original list you would need to make a more complicated object that contained the lists to the original records and hid them in such a way as to be able to decide which list and which element of a list to change or view when a user look to modify/view the complex structure. So something like:
class ComplexStructure():
def add_records(self,record):
self.listofrecords.append(record)
def get_value(self,position):
listnum,posinlist = ... # formula to figure out which list and where in
# list element of complex structure is
return self.listofrecords[listnum][record]
def set_value(self,position,value):
listnum,posinlist = ... # formula to figure out which list and where in
# list element of complex structure is
self.listofrecords[listnum][record] = value
Granted this is not the simple way of doing things you were hoping for but it should do what you need.

Dump an Python list to file as code text

Say, I have a python list:
arr = [
[1,2,3,4],
[11,22,33,44]
]
I want to dump this object to a file with the code so I can eval() it back soon, the content of the file should be :
[
[1,2,3,4],
[11,22,33,44]
]
I don't want to use pickle since it is way too slow.

print repr(arr).
Of course, pickle isn't especially slow. And, as zhangyangyu notes, while this works for a list, it won't work for objects whose repr cannot be eval'd.

I think you can use repr, and then write the result to a file.
repr(...)
repr(object) -> string
Return the canonical string representation of the object.
For most object types, eval(repr(object)) == object.
But this is not safe, if the file is changed, something terrible may happen.
And what's more, it seems the list in your file is of format. And when this happen how do you convert it back. When you read the contents back, you have to add logic to see if the string represents the list comes to end. If they are in one line, it may be easier.
So, using some existing module is not a bad idea and is the usual way.

Use cPickle. It's orders of magnitude faster than pickle.

Use cPickle, and you can do import cPickle as pickle so that existing code doesn't have to change. I'm using cPickle frequently myself and 20+MB files load decently fast (layered dictionaries/lists with thousands of entries).

Elegant way to store dictionary permanently with Python?

Currently expensively parsing a file, which generates a dictionary of ~400 key, value pairs, which is seldomly updated. Previously had a function which parsed the file, wrote it to a text file in dictionary syntax (ie. dict = {'Adam': 'Room 430', 'Bob': 'Room 404'}) etc, and copied and pasted it into another function whose sole purpose was to return that parsed dictionary.
Hence, in every file where I would use that dictionary, I would import that function, and assign it to a variable, which is now that dictionary. Wondering if there's a more elegant way to do this, which does not involve explicitly copying and pasting code around? Using a database kind of seems unnecessary, and the text file gave me the benefit of seeing whether the parsing was done correctly before adding it to the function. But I'm open to suggestions.

Why not dump it to a JSON file, and then load it from there where you need it?
import json
with open('my_dict.json', 'w') as f:
json.dump(my_dict, f)
# elsewhere...
with open('my_dict.json') as f:
my_dict = json.load(f)
Loading from JSON is fairly efficient.
Another option would be to use pickle, but unlike JSON, the files it generates aren't human-readable so you lose out on the visual verification you liked from your old method.

Why mess with all these serialization methods? It's already written to a file as a Python dict (although with the unfortunate name 'dict'). Change your program to write out the data with a better variable name - maybe 'data', or 'catalog', and save the file as a Python file, say data.py. Then you can just import the data directly at runtime without any clumsy copy/pasting or JSON/shelve/etc. parsing:
from data import catalog

JSON is probably the right way to go in many cases; but there might be an alternative. It looks like your keys and your values are always strings, is that right? You might consider using dbm/anydbm. These are "databases" but they act almost exactly like dictionaries. They're great for cheap data persistence.
>>> import anydbm
>>> dict_of_strings = anydbm.open('data', 'c')
>>> dict_of_strings['foo'] = 'bar'
>>> dict_of_strings.close()
>>> dict_of_strings = anydbm.open('data')
>>> dict_of_strings['foo']
'bar'

If the keys are all strings, you can use the shelve module
A shelf is a persistent, dictionary-like object. The difference with
“dbm” databases is that the values (not the keys!) in a shelf can be
essentially arbitrary Python objects — anything that the pickle module
can handle. This includes most class instances, recursive data types,
and objects containing lots of shared sub-objects. The keys are
ordinary strings.
json would be a good choice if you need to use the data from other languages

If storage efficiency matters, use Pickle or CPickle(for execution performance gain). As Amber pointed out, you can also dump/load via Json. It will be human-readable, but takes more disk.

I suggest you consider using the shelve module since your data-structure is a mapping.
That was my answer to a similar question titled If I want to build a custom database, how could I? There's also a bit of sample code in another answer of mine promoting its use for the question How to get a object database?
ActiveState has a highly rated PersistentDict recipe which supports csv, json, and pickle output file formats. It's pretty fast since all three of those formats are implement in C (although the recipe itself is pure Python), so the fact that it reads the whole file into memory when it's opened might be acceptable.

JSON (or YAML, or whatever) serialisation is probably better, but if you're already writing the dictionary to a text file in python syntax, complete with a variable name binding, you could just write that to a .py file instead. Then that python file would be importable and usable as is. There's no need for the "function which returns a dictionary" approach, since you can directly use it as a global in that file. e.g.
# generated.py
please_dont_use_dict_as_a_variable_name = {'Adam': 'Room 430', 'Bob': 'Room 404'}
rather than:
# manually_copied.py
def get_dict():
return {'Adam': 'Room 430', 'Bob': 'Room 404'}
The only difference is that manually_copied.get_dict gives you a fresh copy of the dictionary every time, whereas generated.please_dont_use_dict_as_a_variable_name[1] is a single shared object. This may matter if you're modifying the dictionary in your program after retrieving it, but you can always use copy.copy or copy.deepcopy to create a new copy if you need to modify one independently of the others.
[1] dict, list, str, int, map, etc are generally viewed as bad variable names. The reason is that these are already defined as built-ins, and are used very commonly. So if you give something a name like that, at the least it's going to cause cognitive-dissonance for people reading your code (including you after you've been away for a while) as they have to keep in mind that "dict doesn't mean what it normally does here". It's also quite likely that at some point you'll get an infuriating-to-solve bug reporting that dict objects aren't callable (or something), because some piece of code is trying to use the type dict, but is getting the dictionary object you bound to the name dict instead.

on the JSON direction there is also something called simpleJSON. My first time using json in python the json library didnt work for me/ i couldnt figure it out. simpleJSON was...easier to use

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.