Save a dictionary to a file Python - python

Say I got a dictionary like this:
Test = {"apple":[3,{1:1,3:5,6:7}],"banana":[4,{1:1,3:5,6:7,11:2}]}
Now I want to save this dictionary into a temporary file so that I can reconstruct the dictionary later. (I am doing external sorting XD)
Can any kind guy help me? I know there is a way to save it in csv format, but this one is a special kind of dictionary. Thx a lot.

In order to save a data structure to a file you need to decide on a serialization format. A serialization format takes an in-memory data structure and turns it into a sequence of bytes that can be written to a file. The process of turning that sequence of bytes back into a data structure is called deserialization.
Python provides a number of options for seralization with different characteristics. Here are two common choices and their tradeoffs:
pickle uses a compact format that can represent almost any Python object. However, it's specific to Python, so only Python programs can (easily) decode the file later. The details of the format can also vary between Python releases, so it's best to use pickle only if you will re-read the data file with the same or a newer version of Python than the one that created it. Pickle is able to deal with recursive data structures. Finally, pickle is inappropriate for reading data provided by other possibly-malicious programs, since decoding a pickled data structure can cause arbitrary code to run in your application in order to reconstruct complex objects.
json uses a human-readable text-based format that can represent only a small set of data types: numbers, strings, None, booleans, lists and dictionaries. However, it is a standard format that can be read by equivalent libraries in many other programming languages, and it does not vary from one Python release to another. JSON does not support recursive data structures, but it is generally safe to decode a JSON object from an unknown source except that of course the resulting data structure might use a lot of memory.
In both of these modules the dump function can write a data structure to a file and the load function can later recover it. The difference is in the format of the data written to the file.

The pickle module is quite convenient for serializing python data. It is also probably the fastest way to dump and reload a python data structure.
>>> import pickle
>>> Test = {"apple":[3,{1:1,3:5,6:7}],"banana":[4,{1:1,3:5,6:7,11:2}]}
>>> pickle.dump(Test, open('test_file', 'w'))
>>> pickle.load(open('test_file', 'r'))
{'apple': [3, {1: 1, 3: 5, 6: 7}], 'banana': [4, {1: 1, 3: 5, 11: 2, 6: 7}]}

For me, clearly, the best serializer is msgpack.
http://jmoiron.net/blog/python-serialization/
Same methods as the others.

Related

How to save multiple data at once in Python

I am running a script which takes, say, an hour to generate the data I want. I want to be able to save all of the relevant variables to some external file so I can fiddle with them later without having to run the hour-long calculation over again. Is there an easy way I can save all of the variables I need into one convenient file?
In Matlab I would just contain all of the results of the calculation in a single structure so that later I could just load results.mat and I would have everything I need stored as results.output1, results.output2 or whatever. What is the Python equivalent of this?
In particular, the data that I would like to save includes arrays of complex numbers, which seems to present difficulties for using things like json.
I suggest taking look at built-in shelve module which provides persistent, dictionary-like object and generally does work with all native Python types so you can do:
Write complex to some file (in my example it is named mydata) under key n (keep in mind that keys should be strings).
import shelve
my_number = 2+7j
with shelve.open('mydata') as db:
db['n'] = my_number
Later retrieve that number from given file
import shelve
with shelve.open('mydata') as db:
my_number = db['n']
print(my_number) # (2+7j)
You can use pickle function in Python and then use the dump function to dump all your data into a file. You can reuse the data later.I suggest you find more about pickle.
I would recommend a json file. With json you can assign variables to keywords, just like dictionaries in stock python. The json package is automatically installed when installing python.
import json
dict = {var1: "abcde", var2: "fghij"}
with open(path, "w") as file:
json.dump(dict, file, indent=2, ensure_ascii = False)
You can also load this from a file using the same api:
with open(path, r) as file:
text = file.read()
dict = json.loads(text)
Edit: Json can also handle every datatype python can, so if you want to save an array you can just define that in the dict:
dict = {list1: ["ab", "cd", "ef"]}

Saving and loading simple data in Python convenient way

I'm currently working on a simple Python 3.4.3 and Tkinter game.
I struggle with saving/reading data now, because I'm a beginner at coding.
What I do now is use .txt files to store my data, but I find this extremely counter-intuitive, as saving/reading more than one line of data requires of me to have additional code to catch any newlines.
Skipping a line would be terrible too.
I've googled it, but I either find .txt save/file options or way too complex ones for saving large-scale data.
I only need to save some strings right now and be able to access them (if possible) by key like in a dictionary key:value .
Do you know of any file format/method to help me accomplish that?
Also: If possible, should work on Win/iOS/Linux.
It sounds like using json would be best for this, which comes as part of the Python Standard library in Python-2.6+
import json
data = {'username':'John', 'health':98, 'weapon':'warhammer'}
# serialize the data to user-data.txt
with open('user-data.txt', 'w') as fobj:
json.dump(data, fobj)
# read the data back in
with open('user-data.txt', 'r') as fobj:
data = json.load(fobj)
print(data)
# outputs:
# {u'username': u'John', u'weapon': u'warhammer', u'health': 98}
A popular alternative is yaml, which is actually a superset of json and produces slightly more human readable results.
You might want to try Redis.
http://redis.io/
I'm not totally sure it'll meet all your needs, but it would probably be better than a flat file.

Save python data structure to import later

My program creates a probabilistic model that I want to save as a module to import later. How can I save it in a way that it can be directly imported?
Json is good for dicts, but I have different data structures, Pickle does not seem to allow to use import directly and pprint does not print the name and assignment of the structures.
I would just like to create some data structures:
states = (
'Bound',
'Not-bound'
)
Prob = {
'Bound': 0.45,
'Not-bound': 0.55
}
save them somehow to a 'py' file:
with open('model.py', 'wb') as out:
save(states)
save(Prob)
Then, import them later directly:
import model
print(model.states)
Take a look at the pickle module.
The pickle module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy.
It won't be quite the way you want it to be but I think it's a simple and reasonable way of doing what you want.

Python pickle to xml

How can I convert to a pickle object to a xml document?
For example, I have a pickle like that:
cpyplusplus_test
Coordinate
p0
(I23
I-11
tp1
Rp2
.
I want to get something like:
<Coordinate>
<x>23</x>
<y>-11</y>
</Coordinate>
The Coordinate class has x and y attributes of course. I can supply a xml schema for conversion.
I tried gnosis.xml module. It can objectify xml documents to python object. But it cannot serialize objects to xml documents like above.
Any suggestion?
Thanks.
gnosis.xml does support pickling to XML:
import gnosis.xml.pickle
xml_str = gnosis.xml.pickle.dumps(obj)
To deserialize the XML, use loads:
o2 = gnosis.xml.pickle.loads(xml_str)
Of course, this will not directly convert existing pickles to XML — you have to first deserialize them into live object, and then dump them to XML.
Having said that, I must warn you that gnosis.xml is quite slow, somewhat fragile, and most likely unmaintained (last release was over six years ago). It is also very bloated, containing a huge number of subpackages with lots and lots of features that not only you won't need, but are untested and buggy. We tried to use for our development and, after a lot of effort wasted on trying to debug and improve it, ended up writing a simple XML pickler running at ~500 lines of code, and never looked back.
First you need to unpickle the data by pickle.load or pickle.loads. Then generate xml snippet. If you have a pickle in tmpStr variable, simply do this:
c = pickle.loads(tmpStr)
print '<Coordinate>\n<x>%d</x>\n<y>%d</y>\n</Coordinate>' % (c.x, c.y)
Writing to file is left as an exercise to the reader.

Pickle dump huge file without memory error

I have a program where I basically adjust the probability of certain things happening based on what is already known. My file of data is already saved as a pickle Dictionary object at Dictionary.txt.
The problem is that everytime that I run the program it pulls in the Dictionary.txt, turns it into a dictionary object, makes it's edits and overwrites Dictionary.txt. This is pretty memory intensive as the Dictionary.txt is 123 MB. When I dump I am getting the MemoryError, everything seems fine when I pull it in..
Is there a better (more efficient) way of doing the edits? (Perhaps w/o having to overwrite the entire file everytime)
Is there a way that I can invoke garbage collection (through gc module)? (I already have it auto-enabled via gc.enable())
I know that besides readlines() you can read line-by-line. Is there a way to edit the dictionary incrementally line-by-line when I already have a fully completed Dictionary object File in the program.
Any other solutions?
Thank you for your time.
I was having the same issue. I use joblib and work was done. In case if someone wants to know other possibilities.
save the model to disk
from sklearn.externals import joblib
filename = 'finalized_model.sav'
joblib.dump(model, filename)
some time later... load the model from disk
loaded_model = joblib.load(filename)
result = loaded_model.score(X_test, Y_test)
print(result)
I am the author of a package called klepto (and also the author of dill).
klepto is built to store and retrieve objects in a very simple way, and provides a simple dictionary interface to databases, memory cache, and storage on disk. Below, I show storing large objects in a "directory archive", which is a filesystem directory with one file per entry. I choose to serialize the objects (it's slower, but uses dill, so you can store almost any object), and I choose a cache. Using a memory cache enables me to have fast access to the directory archive, without having to have the entire archive in memory. Interacting with a database or file can be slow, but interacting with memory is fast… and you can populate the memory cache as you like from the archive.
>>> import klepto
>>> d = klepto.archives.dir_archive('stuff', cached=True, serialized=True)
>>> d
dir_archive('stuff', {}, cached=True)
>>> import numpy
>>> # add three entries to the memory cache
>>> d['big1'] = numpy.arange(1000)
>>> d['big2'] = numpy.arange(1000)
>>> d['big3'] = numpy.arange(1000)
>>> # dump from memory cache to the on-disk archive
>>> d.dump()
>>> # clear the memory cache
>>> d.clear()
>>> d
dir_archive('stuff', {}, cached=True)
>>> # only load one entry to the cache from the archive
>>> d.load('big1')
>>> d['big1'][-3:]
array([997, 998, 999])
>>>
klepto provides fast and flexible access to large amounts of storage, and if the archive allows parallel access (e.g. some databases) then you can read results in parallel. It's also easy to share results in different parallel processes or on different machines. Here I create a second archive instance, pointed at the same directory archive. It's easy to pass keys between the two objects, and works no differently from a different process.
>>> f = klepto.archives.dir_archive('stuff', cached=True, serialized=True)
>>> f
dir_archive('stuff', {}, cached=True)
>>> # add some small objects to the first cache
>>> d['small1'] = lambda x:x**2
>>> d['small2'] = (1,2,3)
>>> # dump the objects to the archive
>>> d.dump()
>>> # load one of the small objects to the second cache
>>> f.load('small2')
>>> f
dir_archive('stuff', {'small2': (1, 2, 3)}, cached=True)
You can also pick from various levels of file compression, and whether
you want the files to be memory-mapped. There are a lot of different
options, both for file backends and database backends. The interface
is identical, however.
With regard to your other questions about garbage collection and editing of portions of the dictionary, both are possible with klepto, as you can individually load and remove objects from the memory cache, dump, load, and synchronize with the archive backend, or any of the other dictionary methods.
See a longer tutorial here: https://github.com/mmckerns/tlkklp
Get klepto here: https://github.com/uqfoundation
None of the above answers worked for me. I ended up using Hickle which is a drop-in replacement for pickle based on HDF5. Instead of saving it to a pickle it's saving the data to HDF5 file. The API is identical for most use cases and it has some really cool features such as compression.
pip install hickle
Example:
# Create a numpy array of data
array_obj = np.ones(32768, dtype='float32')
# Dump to file
hkl.dump(array_obj, 'test.hkl', mode='w')
# Load data
array_hkl = hkl.load('test.hkl')
I had memory error and resolved it by using protocol=2:
cPickle.dump(obj, file, protocol=2)
If your key and values are string, you can use one of the embedded persistent key-value storage engines available in Python standard library. Example from the anydbm module docs:
import anydbm
# Open database, creating it if necessary.
db = anydbm.open('cache', 'c')
# Record some values
db['www.python.org'] = 'Python Website'
db['www.cnn.com'] = 'Cable News Network'
# Loop through contents. Other dictionary methods
# such as .keys(), .values() also work.
for k, v in db.iteritems():
print k, '\t', v
# Storing a non-string key or value will raise an exception (most
# likely a TypeError).
db['www.yahoo.com'] = 4
# Close when done.
db.close()
Have you tried using streaming pickle: https://code.google.com/p/streaming-pickle/
I have just solved a similar memory error by switching to streaming pickle.
How about this?
import cPickle as pickle
p = pickle.Pickler(open("temp.p","wb"))
p.fast = True
p.dump(d) # d could be your dictionary or any file
I recently had this problem. After trying cpickle with ASCII and the binary protocol 2, I found that my SVM from sci-kit learn trained on 20+ gb of data was not pickling due to a memory error. However, the dill package seemed to resolve the issue. Dill will not create many improvements for a dictionary but may help with streaming. It is meant to stream pickled bytes across a network.
import dill
with open(path,'wb') as fp:
dill.dump(outpath,fp)
dill.load(fp)
If efficiency is an issue, try loading/saving to a database. In this instance, your storage solution may be an issue. At 123 mb Pandas should be fine. However, if the machine has limited memory SQL offers fast,optimized, bag operations over data, usually with multithreaded support.
My poly kernel svm saved.
This may seem trivial, but try to use the 64bit Python if you are not.
I have tried the following solution, but all of them can't resolve my problem.
Using hickle to replace pickle
Using joblib to replace pickle
Using sklearn.externals joblib to replace pickle
Change the pickle mode
Provide a different method for this issue:
Finally, I found the root cause is that the work directory folder was too long.
So that I change the directory to a very short structure.
Enjoy it.

Categories

Resources