I'm using CherryPy and it seems to not behave nicely when it comes to retrieving data from stored files on the server. (I asked for help on that and nobody replied, so I'm on to plan B, or C...) Now I have stored a class containing a bunch of data structures (3 dictionaries and two lists of lists all related) in a MySQL table, and amazingly, it was easier than I thought to insert the binary object (longblob). I turned it into a pickle file and INSERTED it.
However, I can't figure out how to reconstitute the pickle and rebuild the class full of data from it now. The database returns a giant string that looks like the pickle, but how to you make a string into a file-like object so that pickle.load(data) will work?
Alternative solutions: How to save the class as a BLOB in database, or some ideas on why I can save a pickle of this class but when I go to load it later, the class seems to be lost. But in SSH / locally, it works - only when calling pickle.load(xxx) from cherrypy do I get errors.
I'm up for plan D - if there's a better way to store a collection of structured data for fast retrieval without pickles or MYSQL blobs...
You can create a file-like in-memory object with (c)StringIO:
>>> from cStringIO import StringIO
>>> fobj = StringIO('file\ncontent')
>>> for line in fobj:
... print line
...
file
content
But for pickle usage you can directly load and dump to a string (have a look at the s in the function names):
>>> import pickle
>>> obj = 1
>>> serialized = pickle.dumps(obj)
>>> serialized
'I1\n.'
>>> pickle.loads(serialized)
1
But for structured data stored in a database, I would suggest in general that you either use
a table, preferable with an ORM like sqlalchemy so it is directly mapped to a class or
a dictionary, which could be easily (de)serialized with JSON
and not using pickle at all.
I struggled with this myself.
Convert to bytes using the UTF-8 charset and try to load the data in your object.
CurrentShoppingCart.SetCartItems(pickle.loads(bytes(DBCart[0]['Cart'], 'UTF-8')))
Andrew
Related
I am running a script which takes, say, an hour to generate the data I want. I want to be able to save all of the relevant variables to some external file so I can fiddle with them later without having to run the hour-long calculation over again. Is there an easy way I can save all of the variables I need into one convenient file?
In Matlab I would just contain all of the results of the calculation in a single structure so that later I could just load results.mat and I would have everything I need stored as results.output1, results.output2 or whatever. What is the Python equivalent of this?
In particular, the data that I would like to save includes arrays of complex numbers, which seems to present difficulties for using things like json.
I suggest taking look at built-in shelve module which provides persistent, dictionary-like object and generally does work with all native Python types so you can do:
Write complex to some file (in my example it is named mydata) under key n (keep in mind that keys should be strings).
import shelve
my_number = 2+7j
with shelve.open('mydata') as db:
db['n'] = my_number
Later retrieve that number from given file
import shelve
with shelve.open('mydata') as db:
my_number = db['n']
print(my_number) # (2+7j)
You can use pickle function in Python and then use the dump function to dump all your data into a file. You can reuse the data later.I suggest you find more about pickle.
I would recommend a json file. With json you can assign variables to keywords, just like dictionaries in stock python. The json package is automatically installed when installing python.
import json
dict = {var1: "abcde", var2: "fghij"}
with open(path, "w") as file:
json.dump(dict, file, indent=2, ensure_ascii = False)
You can also load this from a file using the same api:
with open(path, r) as file:
text = file.read()
dict = json.loads(text)
Edit: Json can also handle every datatype python can, so if you want to save an array you can just define that in the dict:
dict = {list1: ["ab", "cd", "ef"]}
I have a list in my program. I have a function to append to the list, unfortunately when you close the program the thing you added goes away and the list goes back to the beginning. Is there any way that I can store the data so the user can re-open the program and the list is at its full.
You may try pickle module to store the memory data into disk,Here is an example:
store data:
import pickle
dataset = ['hello','test']
outputFile = 'test.data'
fw = open(outputFile, 'wb')
pickle.dump(dataset, fw)
fw.close()
load data:
import pickle
inputFile = 'test.data'
fd = open(inputFile, 'rb')
dataset = pickle.load(fd)
print dataset
You can make a database and save them, the only way is this. A database with SQLITE or a .txt file. For example:
with open("mylist.txt","w") as f: #in write mode
f.write("{}".format(mylist))
Your list goes into the format() function. It'll make a .txt file named mylist and will save your list data into it.
After that, when you want to access your data again, you can do:
with open("mylist.txt") as f: #in read mode, not in write mode, careful
rd=f.readlines()
print (rd)
The built-in pickle module provides some basic functionality for serialization, which is a term for turning arbitrary objects into something suitable to be written to disk. Check out the docs for Python 2 or Python 3.
Pickle isn't very robust though, and for more complex data you'll likely want to look into a database module like the built-in sqlite3 or a full-fledged object-relational mapping (ORM) like SQLAlchemy.
For storing big data, HDF5 library is suitable. It is implemented by h5py in Python.
My program creates a probabilistic model that I want to save as a module to import later. How can I save it in a way that it can be directly imported?
Json is good for dicts, but I have different data structures, Pickle does not seem to allow to use import directly and pprint does not print the name and assignment of the structures.
I would just like to create some data structures:
states = (
'Bound',
'Not-bound'
)
Prob = {
'Bound': 0.45,
'Not-bound': 0.55
}
save them somehow to a 'py' file:
with open('model.py', 'wb') as out:
save(states)
save(Prob)
Then, import them later directly:
import model
print(model.states)
Take a look at the pickle module.
The pickle module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy.
It won't be quite the way you want it to be but I think it's a simple and reasonable way of doing what you want.
I have a program where I basically adjust the probability of certain things happening based on what is already known. My file of data is already saved as a pickle Dictionary object at Dictionary.txt.
The problem is that everytime that I run the program it pulls in the Dictionary.txt, turns it into a dictionary object, makes it's edits and overwrites Dictionary.txt. This is pretty memory intensive as the Dictionary.txt is 123 MB. When I dump I am getting the MemoryError, everything seems fine when I pull it in..
Is there a better (more efficient) way of doing the edits? (Perhaps w/o having to overwrite the entire file everytime)
Is there a way that I can invoke garbage collection (through gc module)? (I already have it auto-enabled via gc.enable())
I know that besides readlines() you can read line-by-line. Is there a way to edit the dictionary incrementally line-by-line when I already have a fully completed Dictionary object File in the program.
Any other solutions?
Thank you for your time.
I was having the same issue. I use joblib and work was done. In case if someone wants to know other possibilities.
save the model to disk
from sklearn.externals import joblib
filename = 'finalized_model.sav'
joblib.dump(model, filename)
some time later... load the model from disk
loaded_model = joblib.load(filename)
result = loaded_model.score(X_test, Y_test)
print(result)
I am the author of a package called klepto (and also the author of dill).
klepto is built to store and retrieve objects in a very simple way, and provides a simple dictionary interface to databases, memory cache, and storage on disk. Below, I show storing large objects in a "directory archive", which is a filesystem directory with one file per entry. I choose to serialize the objects (it's slower, but uses dill, so you can store almost any object), and I choose a cache. Using a memory cache enables me to have fast access to the directory archive, without having to have the entire archive in memory. Interacting with a database or file can be slow, but interacting with memory is fast… and you can populate the memory cache as you like from the archive.
>>> import klepto
>>> d = klepto.archives.dir_archive('stuff', cached=True, serialized=True)
>>> d
dir_archive('stuff', {}, cached=True)
>>> import numpy
>>> # add three entries to the memory cache
>>> d['big1'] = numpy.arange(1000)
>>> d['big2'] = numpy.arange(1000)
>>> d['big3'] = numpy.arange(1000)
>>> # dump from memory cache to the on-disk archive
>>> d.dump()
>>> # clear the memory cache
>>> d.clear()
>>> d
dir_archive('stuff', {}, cached=True)
>>> # only load one entry to the cache from the archive
>>> d.load('big1')
>>> d['big1'][-3:]
array([997, 998, 999])
>>>
klepto provides fast and flexible access to large amounts of storage, and if the archive allows parallel access (e.g. some databases) then you can read results in parallel. It's also easy to share results in different parallel processes or on different machines. Here I create a second archive instance, pointed at the same directory archive. It's easy to pass keys between the two objects, and works no differently from a different process.
>>> f = klepto.archives.dir_archive('stuff', cached=True, serialized=True)
>>> f
dir_archive('stuff', {}, cached=True)
>>> # add some small objects to the first cache
>>> d['small1'] = lambda x:x**2
>>> d['small2'] = (1,2,3)
>>> # dump the objects to the archive
>>> d.dump()
>>> # load one of the small objects to the second cache
>>> f.load('small2')
>>> f
dir_archive('stuff', {'small2': (1, 2, 3)}, cached=True)
You can also pick from various levels of file compression, and whether
you want the files to be memory-mapped. There are a lot of different
options, both for file backends and database backends. The interface
is identical, however.
With regard to your other questions about garbage collection and editing of portions of the dictionary, both are possible with klepto, as you can individually load and remove objects from the memory cache, dump, load, and synchronize with the archive backend, or any of the other dictionary methods.
See a longer tutorial here: https://github.com/mmckerns/tlkklp
Get klepto here: https://github.com/uqfoundation
None of the above answers worked for me. I ended up using Hickle which is a drop-in replacement for pickle based on HDF5. Instead of saving it to a pickle it's saving the data to HDF5 file. The API is identical for most use cases and it has some really cool features such as compression.
pip install hickle
Example:
# Create a numpy array of data
array_obj = np.ones(32768, dtype='float32')
# Dump to file
hkl.dump(array_obj, 'test.hkl', mode='w')
# Load data
array_hkl = hkl.load('test.hkl')
I had memory error and resolved it by using protocol=2:
cPickle.dump(obj, file, protocol=2)
If your key and values are string, you can use one of the embedded persistent key-value storage engines available in Python standard library. Example from the anydbm module docs:
import anydbm
# Open database, creating it if necessary.
db = anydbm.open('cache', 'c')
# Record some values
db['www.python.org'] = 'Python Website'
db['www.cnn.com'] = 'Cable News Network'
# Loop through contents. Other dictionary methods
# such as .keys(), .values() also work.
for k, v in db.iteritems():
print k, '\t', v
# Storing a non-string key or value will raise an exception (most
# likely a TypeError).
db['www.yahoo.com'] = 4
# Close when done.
db.close()
Have you tried using streaming pickle: https://code.google.com/p/streaming-pickle/
I have just solved a similar memory error by switching to streaming pickle.
How about this?
import cPickle as pickle
p = pickle.Pickler(open("temp.p","wb"))
p.fast = True
p.dump(d) # d could be your dictionary or any file
I recently had this problem. After trying cpickle with ASCII and the binary protocol 2, I found that my SVM from sci-kit learn trained on 20+ gb of data was not pickling due to a memory error. However, the dill package seemed to resolve the issue. Dill will not create many improvements for a dictionary but may help with streaming. It is meant to stream pickled bytes across a network.
import dill
with open(path,'wb') as fp:
dill.dump(outpath,fp)
dill.load(fp)
If efficiency is an issue, try loading/saving to a database. In this instance, your storage solution may be an issue. At 123 mb Pandas should be fine. However, if the machine has limited memory SQL offers fast,optimized, bag operations over data, usually with multithreaded support.
My poly kernel svm saved.
This may seem trivial, but try to use the 64bit Python if you are not.
I have tried the following solution, but all of them can't resolve my problem.
Using hickle to replace pickle
Using joblib to replace pickle
Using sklearn.externals joblib to replace pickle
Change the pickle mode
Provide a different method for this issue:
Finally, I found the root cause is that the work directory folder was too long.
So that I change the directory to a very short structure.
Enjoy it.
Is there a way to parse CSV data in Python when the data is not in a file? I'm storing CSV data in my database and I'd like to parse it. I'm looking for something analogous to Ruby's CSV.parse. I know Python has a CSV class but everything I've seen in the docs seems to deal with files as opposed to in-memory CSV data.
(And it's not an option to parse the data before it goes into the database.)
(And please don't tell me not to store the CSV data in the database. I know what I'm doing as far as the database goes.)
There is no special distinction for files about the python csv module. You can use StringIO to wrap your strings as file-like objects.
Here is why you should use cStringIO.StringIO (io.StringIO in Python 3.x) instead of some DIY kludge:
>>> import csv
>>> from cStringIO import StringIO
>>> fromDB = '"Column\nheading1",hdng2\r\n"data1\rfoo","data2\r\nfoo"\r\n'
>>> sources = [StringIO(fromDB), fromDB.splitlines(True),
... fromDB.splitlines(), fromDB.split("\n")]
>>> for i, source in enumerate(sources):
... print i, list(csv.reader(source))
...
0 [['Column\nheading1', 'hdng2'], ['data1\rfoo', 'data2\r\nfoo']] # OK
1 [['Column\nheading1', 'hdng2'], ['data1\rfoo', 'data2\r\nfoo']] # OK
2 [['Columnheading1', 'hdng2'], ['data1foo', 'data2foo']] # 3 errors
3 [['Columnheading1', 'hdng2'], ['data1\rfoo', 'data2\rfoo'], []] # 3 errors
>>>
Using guff.splitlines(True) is not recommended as it has a far greater chance than StringIO(guff) that whoever is reading your code will not have a clue what it does.
Use the stringio module, which allows you to dress strings as file-like objects. That way you can pass a stringio "file" to the CSV module for parsing (or any other parser you may be using).
http://docs.python.org/library/csv.html
csv.reader(csvfile)
csvfile can be any object which
supports the iterator protocol and
returns a string each time its next()
method is called — file objects and
list objects are both suitable.
If you have e.g. the content from DB in a string you can parse it like
import csv
fromDB = "1,2,3\n4,5,6"
reader = csv.reader(fromDB.split("\n"))
for row in reader:
print("New row")
for col in row:
print(" ", col)