Save Numpy Array using Pickle - python

I've got a Numpy array that I would like to save (130,000 x 3) that I would like to save using Pickle, with the following code. However, I keep getting the error "EOFError: Ran out of input" or "UnsupportedOperation: read" at the pkl.load line. This is my first time using Pickle, any ideas?
Thanks,
Anant
import pickle as pkl
import numpy as np
arrayInput = np.zeros((1000,2)) #Trial input
save = True
load = True
filename = path + 'CNN_Input'
fileObject = open(fileName, 'wb')
if save:
pkl.dump(arrayInput, fileObject)
fileObject.close()
if load:
fileObject2 = open(fileName, 'wb')
modelInput = pkl.load(fileObject2)
fileObject2.close()
if arrayInput == modelInput:
Print(True)

You should use numpy.save and numpy.load.

I have no problems using pickle:
In [126]: arr = np.zeros((1000,2))
In [127]: with open('test.pkl','wb') as f:
...: pickle.dump(arr, f)
...:
In [128]: with open('test.pkl','rb') as f:
...: x = pickle.load(f)
...: print(x.shape)
...:
...:
(1000, 2)
pickle and np.save/load have a deep reciprocity. Like I can load this pickle with np.load:
In [129]: np.load('test.pkl').shape
Out[129]: (1000, 2)
If I open the pickle file in the wrong I do get your error:
In [130]: with open('test.pkl','wb') as f:
...: x = pickle.load(f)
...: print(x.shape)
...:
UnsupportedOperation: read
But that shouldn't be surprising - you can't read a freshly opened write file. It will be empty.
np.save/load is the usual pair for writing numpy arrays. But pickle uses save to serialize arrays, and save uses pickle to serialize non-array objects (in the array). Resulting file sizes are similar. Curiously in timings the pickle version is faster.

It's been a bit but if you're finding this, Pickle completes in a fraction of the time.
with open('filename','wb') as f: pickle.dump(arrayname, f)
with open('filename','rb') as f: arrayname1 = pickle.load(f)
numpy.array_equal(arrayname,arrayname1) #sanity check
On the other hand, by default numpy compress took my 5.2GB down to .4GB and Pickle went to 1.7GB.

Don't use pickle for numpy arrays, for an extended discussion that links to all resources I could find see my answer here.
Short reasons:
there is already a nice interface the developers of numpy made and will save you lots of time of debugging (most important reason)
np.save,np.load,np.savez have pretty good performance in most metrics, see this, which is to be expected since it's an established library and the developers of numpy made those functions.
Pickle executes arbitrary code and is a security issue
to use pickle you would have to open and file and might get issues that leads to bugs (e.g. I wasn't aware of using b and it stopped working, took time to debug)
if you refuse to accept this advice, at least really articulate the reason you need to use something else. Make sure it's crystal clear in your head.
Avoid repeating code at all costs if a solution already exists!
Anyway, here are all the interfaces I tried, hopefully it saves someone time (probably my future self):
import numpy as np
import pickle
from pathlib import Path
path = Path('~/data/tmp/').expanduser()
path.mkdir(parents=True, exist_ok=True)
lb,ub = -1,1
num_samples = 5
x = np.random.uniform(low=lb,high=ub,size=(1,num_samples))
y = x**2 + x + 2
# using save (to npy), savez (to npz)
np.save(path/'x', x)
np.save(path/'y', y)
np.savez(path/'db', x=x, y=y)
with open(path/'db.pkl', 'wb') as db_file:
pickle.dump(obj={'x':x, 'y':y}, file=db_file)
## using loading npy, npz files
x_loaded = np.load(path/'x.npy')
y_load = np.load(path/'y.npy')
db = np.load(path/'db.npz')
with open(path/'db.pkl', 'rb') as db_file:
db_pkl = pickle.load(db_file)
print(x is x_loaded)
print(x == x_loaded)
print(x == db['x'])
print(x == db_pkl['x'])
print('done')
but most useful see my answer here.

Here is one of extra possible ways. Sometimes you should add extra option protocol. For example,
import pickle
# Your array
arrayInput = np.zeros((1000,2))
Here is your approach:
pickle.dump(arrayInput, open('file_name.pickle', 'wb'))
Which you can change to:
# in two lines of code
with open("file_name.pickle", "wb") as f:
pickle.dump(arrayInput, f, protocol=pickle.HIGHEST_PROTOCOL)
or
# Save in line of code
pickle.dump(arrayInput, open("file_name.pickle", "wb"), protocol=pickle.HIGHEST_PROTOCOL)
Aftermath, you can easy read you numpy array like:
arrayInput = pickle.load(open(self._dir_models+"model_rf.sav", 'rb'))
Hope, it is useful for you

Many are forgetting one very important thing: security.
Pickled data is binary, so it gets run immediately upon using pickle.load. If loading from an untrusted source, the file could contain executable instructions to achieve things like man-in-the-middle attacks over a network, among other things. (e.g. see this realpython.com article)
Pure pickled data may be faster to save/load if you don't follow with bz2 compression, and hence have a larger file size, but numpy load/save may be more secure.
Alternatively, you may save purely pickled data along with an encryption key using the builtin hashlib and hmac libraries and, prior to loading, compare the hash key against your security key:
import hashlib
import hmac
def calculate_hash(
key_,
file_path,
hash_=hashlib.sha256
):
with open(file_path, "rb") as fp:
file_hash = hmac.new(key_, fp.read(), hash_).hexdigest()
return file_hash
def compare_hash(
hash1,
hash2,
):
"""
Warning:
Do not use `==` directly to compare hash values. Timing attacks can be used
to learn your security key. Use ``compare_digest()``.
"""
return hmac.compare_digest(hash1, hash2)
In a corporate setting, always be sure to confirm with your IT department. You want to be sure proper authentication, encryption, and authorization is all "set to go" when loading and saving data over servers and networks.
Pickle/CPickle
If you are confident you are using nothing but trusted sources and speed is a major concern over security and file size, pickle might be the way to go. In addition, you can take a few extra security measures using cPickle (this may have been incorporated directly into pickle in recent Python3 versions, but I'm not sure, so always double-check):
Use a cPickle.Unpickler instance, and set its "find_global" attribute to None to disable importing any modules (thus restricting loading to builtin types such as dict, int, list, string, etc).
Use a cPickle.Unpickler instance, and set its "find_global" attribute to a function that only allows importing of modules and names from a whitelist.
Use something like the itsdangerous package to authenticate the data before unpickling it if you're loading it from an untrusted source.
Numpy
If you are only saving numpy data and no other python data, and security is a greater priority over file size and speed, then numpy might be the way to go.
HDF5/H5PY
If your data is truly large and complex, hdf5 format via h5py is good.
JSON
And of course, this discussion wouldn't be complete without mentioning json. You may need to do extra work setting up encoding and decoding of your data, but nothing gets immediately run when you use json.load, so you can check the template/structure of the loaded data before you use it.
DISCLAIMER: I take no responsibility for end-user security with this provided information. The above information is for informational purposes only. Please use proper discretion and appropriate measures (including corporate policies, where applicable) with regard to security needs.

You should use numpy.save() for saving numpy matrices.

In your code, you're using
if load:
fileObject2 = open(fileName, 'wb')
modelInput = pkl.load(fileObject2)
fileObject2.close()
The second argument in the open function is the method. w stands for writing, r for reading. The second character b denotes that bytes will be read/written. A file that will be written to cannot be read and vice versa. Therefore, opening the file with fileObject2 = open(fileName, 'rb') will do the trick.

The easiest way to save and load a NumPy array -
# a numpy array
result.importances_mean
array([-1.43651529e-03, -2.73401297e-03, 9.26784059e-05, -7.41427247e-04,
3.56811863e-03, 2.78035218e-03, 3.70713624e-03, 5.51436515e-03,
1.16821131e-01, 9.26784059e-05, 9.26784059e-04, -1.80722892e-03,
-1.71455051e-03, -1.29749768e-03, -9.26784059e-05, -1.43651529e-03,
0.00000000e+00, -1.11214087e-03, -4.63392030e-05, -4.63392030e-04,
1.20481928e-03, 5.42168675e-03, -5.56070436e-04, 8.34105653e-04,
-1.85356812e-04, 0.00000000e+00, -9.73123262e-04, -1.43651529e-03,
-1.76088971e-03])
# save the array format - np.save(filename.npy, array)
np.save(os.path.join(model_path, "permutation_imp.npy"), result.importances_mean)
# load the array format - np.load(filename.npy)
res = np.load(os.path.join(model_path, "permutation_imp.npy"))
res
array([-1.43651529e-03, -2.73401297e-03, 9.26784059e-05, -7.41427247e-04,
3.56811863e-03, 2.78035218e-03, 3.70713624e-03, 5.51436515e-03,
1.16821131e-01, 9.26784059e-05, 9.26784059e-04, -1.80722892e-03,
-1.71455051e-03, -1.29749768e-03, -9.26784059e-05, -1.43651529e-03,
0.00000000e+00, -1.11214087e-03, -4.63392030e-05, -4.63392030e-04,
1.20481928e-03, 5.42168675e-03, -5.56070436e-04, 8.34105653e-04,
-1.85356812e-04, 0.00000000e+00, -9.73123262e-04, -1.43651529e-03,
-1.76088971e-03])

Related

Allow_Pickle = True modified my dictionary to "unsized" when loaded

I am trying to save and load variables (dictionaries) to use in other notebooks. I save the variables with:
with open('opp2b.npy', 'wb') as f:
np.save(f, mak)
np.save(f, mp)
len(mak)
82
mak and mp are dictionaries with 82 entries of the same length.
When loading if not using allow_pickle = True it will not load. So I use this
with open('opp2b.npy', 'rb') as f:
mak = np.load(f, allow_pickle=True)
mp = np.load(f, allow_pickle=True)
and when I check the length I get
len(mak)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-16-bb967ce1f5ef> in <module>
----> 1 len(mak)
TypeError: len() of unsized object
1
I am not sure why the array is modified, but it is now unusable for what I need it.
Per your comments, mak is not a numpy array at all. numpy.save is specifically documented to:
Save an array to a binary file in NumPy .npy format.
allow_pickle is for numpy arrays containing Python objects, but the .npy format is not intended to store things that aren't numpy arrays at all. To successfully store the dict, it's wrapping it in a 0D numpy "array", and that's what np.load is giving you. You could extract the original dict by doing:
mak = mak.item(0) # mak = mak[0] doesn't work, and I'm unclear on why .item(0) works,
# as the docs claim the only difference is that .item(0) returns
# a Python scalar, rather than a numpy scalar, and that's not an
# issue here, but I assume something about 0D arrays requires this
But really, that's trying to put a square peg in a round hole. If you're not storing numpy arrays, there's little benefit to the .npy format, if any. The main advantages it provides are:
Avoiding arbitrary code execution for untrusted inputs (since you need to allow_pickle, that advantage goes away)
Allowing you to memory map on load (irrelevant when the entire data structure must be pickled anyway; memory mapping helps only for C level data where you might benefit from lazy loading and better performance if RAM grows short, as the data need not be written to swap before the pages are reclaimed)
(No longer relevant on modern Python) Stores array data more efficiently than the old pickle protocol 0 (that produced legal ASCII output, meaning only bytes of 127 or below, which made pickling raw binary data inefficient). As long as you're using protocol 2 or higher (which is binary, handles new-style classes efficiently, and is supported back to Python 2.3), it should store your data efficiently. As of Python 3.0, the default protocol is protocol 3 (rising to protocol 4 in 3.8), so if you're using a supported version of Python, and don't specify the protocol, it will use 3 or 4 (both of which work fine; protocol 4 being better if you're pickling huge objects).
Since you aren't storing numpy arrays, just rely on the pickle module directly to store arbitrary data (for modern pickle protocols, which allow efficient binary storage, numpy stores efficiently enough anyway, so the .npy format isn't helping much, if at all; for some trivial test cases I tried, saving {'a': numpy.array([0,1,2])}, the .npy dump was over twice the size).
import pickle # At top of file
with open('opp2b.pkl', 'wb') as f: # Name with common pickle extension instead of .npy
pickle.dump(mak, f) # Argument order reversed from np.save
pickle.dump(mp, f)
and then to load:
with open('opp2b.pkl', 'rb') as f: # Matching change in name
mak = pickle.load(f)
mp = pickle.load(f)
This assumes you might in fact want to load only one data set or the other at a time; if you plan to store and load both all the time, you may as well condense it to a single write of a tuple of the relevant values (increasing the chance that duplicated objects across the two objects can use back-references to avoid reserializing the same data multiple times), e.g.:
with open('opp2b.pkl', 'wb') as f:
pickle.dump((mak, mp), f)
and:
with open('opp2b.pkl', 'rb') as f:
mak, mp = pickle.load(f)

Speed up reading multiple pickle files

I have a lot of pickle files. Currently I read them in a loop but it takes a lot of time. I would like to speed it up but don't have any idea how to do that.
Multiprocessing wouldn't work because in order to transfer data from a child subprocess to the main process data need to be serialized (pickled) and deserialized.
Using threading wouldn't help either because of GIL.
I think that the solution would be some library written in C that takes a list of files to read and then runs multiple threads (without GIL). Is there something like this around?
UPDATE
Answering your questions:
Files are partial products of data processing for the purpose of ML
There are pandas.Series objects but the dtype is not known upfront
I want to have many files because we want to pick any subset easily
I want to have many smaller files instead of one big file because deserialization of one big file takes more memory (at some point in time we have serialized string and deserialized objects)
The size of the files can vary a lot
I use python 3.7 so I believe it's cPickle in fact
Using pickle is very flexible because I don't have to worry about underlying types - I can save anything
I agree with what has been noted in the comments, namely that due to the constraint of python itself (chiefly, the GIL lock, as you noted) and there may simply be no faster loading the information beyond what you are doing now. Or, if there is a way, it may be both highly technical and, in the end, only gives you a modest increase in speed.
That said, depending on the datatypes you have, it may be faster to use quickle or pyrobuf.
I think that the solution would be some library written in C that
takes a list of files to read and then runs multiple threads (without
GIL). Is there something like this around?
In short: no. pickle is apparently good enough for enough people that there are no major alternate implementations fully compatible with the pickle protocol. As of sometime in python 3, cPickle was merged with pickle, and neither release the GIL anyway which is why threading won't help you (search for Py_BEGIN_ALLOW_THREADS in _pickle.c and you will find nothing).
If your data can be re-structured into a simpler data format like csv, or a binary format like numpy's npy, there will be less cpu overhead when reading your data. Pickle is built for flexibility first rather than speed or compactness first. One possible exception to the rule of more complex less speed is the HDF5 format using h5py, which can be fairly complex, and I have used to max out the bandwidth of a sata ssd.
Finally you mention you have many many pickle files, and that itself is probably causing no small amount of overhead. Each time you open a new file, there's some overhead involved from the operating system. Conveniently you can combine pickle files by simply appending them together. Then you can call Unpickler.load() until you reach the end of the file. Here's a quick example of combining two pickle files together using shutil
import pickle, shutil, os
#some dummy data
d1 = {'a': 1, 'b': 2, 1: 'a', 2: 'b'}
d2 = {'c': 3, 'd': 4, 3: 'c', 4: 'd'}
#create two pickles
with open('test1.pickle', 'wb') as f:
pickle.Pickler(f).dump(d1)
with open('test2.pickle', 'wb') as f:
pickle.Pickler(f).dump(d2)
#combine list of pickle files
with open('test3.pickle', 'wb') as dst:
for pickle_file in ['test1.pickle', 'test2.pickle']:
with open(pickle_file, 'rb') as src:
shutil.copyfileobj(src, dst)
#unpack the data
with open('test3.pickle', 'rb') as f:
p = pickle.Unpickler(f)
while True:
try:
print(p.load())
except EOFError:
break
#cleanup
os.remove('test1.pickle')
os.remove('test2.pickle')
os.remove('test3.pickle')
I think you should try and use mmap(memory mapped files) that is similar to open() but way faster.
Note: If your each file is big in size then use mmap otherwise If the files are small in size use regular methods.
I have written a sample that you can try.
import mmap
from time import perf_counter as pf
def load_files(filelist):
start = pf() # for rough time calculations
for filename in filelist:
with open(filename, mode="r", encoding="utf8") as file_obj:
with mmap.mmap(file_obj.fileno(), length=0, access=mmap.ACCESS_READ) as mmap_file_obj:
data = pickle.load(mmap_file_obj)
print(data)
print(f'Operation took {pf()-start} sec(s)')
Here mmap.ACCESS_READ is the mode to open the file in binary. The file_obj returned by open is just used to get the file descriptor which is used to open the stream to the file via mmap as a memory mapped file.
As you can see below in the documentation of python open returns the file descriptor or fd for short. So we don't have to do anything with the file_obj operation wise. We just need its fileno() method to get its file descriptor. Also we are not closing the file_obj before the mmap_file_obj. Please take a proper look. We are closing the the mmap block first.
As you said in your comment.
open (file, flags[, mode])
Open the file file and set various flags according to flags and possibly its mode according to mode.
The default mode is 0777 (octal), and the current umask value is first masked out.
Return the file descriptor for the newly opened file.
Give it a try and see how much impact does it do on your operation
You can read more about mmap here. And about file descriptor here
You can try multiprocessing:
import os,pickle
pickle_list=os.listdir("pickles")
output_dict=dict.fromkeys(pickle_list, '')
def pickle_process_func(picklename):
with open("pickles/"+picklename, 'rb') as file:
dapickle=pickle.load(file)
#if you need previus files output wait for it
while(!output_dict[pickle_list[pickle_list.index(picklename)-1]]):
continue
#thandosomesh
print("loaded")
output_dict[picklename]=custom_func_i_dunno(dapickle)
from multiprocessing import Pool
with Pool(processes=10) as pool:
pool.map(pickle_process_func, pickle_list)
Consider using HDF5 via h5py instead of pickle. The performance is generally much better than pickle with numerical data in Pandas and numpy data structures and it supports most common data types and compression.

Python hangs silently on large file write

I am trying to write a big list of numpy nd_arrays to disk.
The list is ~50000 elements long
Each element is a nd_array of size (~2048,2) of ints. The arrays have different shapes.
The method I am (curently) using is
#staticmethod
def _write_with_yaml(path, obj):
with io.open(path, 'w+', encoding='utf8') as outfile:
yaml.dump(obj, outfile, default_flow_style=False, allow_unicode=True)
I have also tried pickle which also give the same problem:
On small lists (~3400 long), this works fine, finishes fast enough (<30 sec).
On ~6000 long lists, this finishes after ~2 minutes.
When the list gets larger, the process seems not to do anything. No change in RAM or disk activity.
I stopped waiting after 30 minutes.
After force stopping the process, the file suddenly became of significant size (~600MB).
I can't know if it finished writing or not.
What is the correct way to write such large lists, know if he write succeeded, and, if possible, knowing when the write/read is going to finish?
How can I debug what's happening when the process seems to hang?
I prefer not to break and assemble the lists manually in my code, I expect the serialization libraries to be able to do that for me.
For the code
import numpy as np
import yaml
x = []
for i in range(0,50000):
x.append(np.random.rand(2048,2))
print("Arrays generated")
with open("t.yaml", 'w+', encoding='utf8') as outfile:
yaml.dump(x, outfile, default_flow_style=False, allow_unicode=True)
on my system (MacOSX, i7, 16 GiB RAM, SSD) with Python 3.7 and PyYAML 3.13 the finish time is 61min. During the save the python process occupied around 5 GBytes of memory and final file size is 2 GBytes. This also shows the overhead of the file format: as the size of the data is 50k * 2048 * 2 * 8 (the size of a float is generally 64 bits in python) = 1562 MBytes, means yaml is around 1.3 times worse (and serialisation/deserialisation is also taking time).
To answer your questions:
There is no correct or incorrect way. To have a progress update and
estimation of finishing time is not easy (ex: other tasks might
interfere with the estimation, resources like memory could be used
up, etc.). You can rely on a library that supports that or implement
something yourself (as the other answer suggested)
Not sure "debug" is the correct term, as in practice it might be that the process just slow. Doing a performance analysis is not easy, especially if
using multiple/different libraries. What I would start with is clear
requirements: what do you want from the file saved? Do they need to
be yaml? Saving 50k arrays as yaml does not seem the best solution
if you care about performance. Should you ask yourself first "which is the best format for what I want?" (but you did not give details so can't say...)
Edit: if you want something just fast, use pickle. The code:
import numpy as np
import yaml
import pickle
x = []
for i in range(0,50000):
x.append(np.random.rand(2048,2))
print("Arrays generated")
pickle.dump( x, open( "t.yaml", "wb" ) )
finishes in 9 seconds, and generates a file of 1.5GBytes (no overhead). Of course pickle format should be used in very different circumstances than yaml...
I cant say this is the answer, but it may be it.
When I was working on app that required fast cycles, I found out that something in the code is very slow. It was opening / closing yaml files.
It was solved by using JSON.
Dont use YAML for anything else than as some kind of config you dont open often.
Solution to your array saving:
np.save(path,array) # path = path+name+'.npy'
If you really need to save a list of arrays, I recommend you to save list with array paths(array themselfs you will save on disk with np.save). Saving python objects on disk is not really what you want. What you want is to save numpy arrays with np.save
Complete solution(Saving example):
for array_index in range(len(list_of_arrays)):
np.save(array_index+'.npy',list_of_arrays[array_index])
# path = array_index+'.npy'
Complete solution(Loading example):
list_of_array_paths = ['1.npy','2.npy']
list_of_arrays = []
for array_path in list_of_array_paths:
list_of_arrays.append(np.load(array_path))
Further advice:
Python cant really handle large arrays. Moreover if you have loaded several of them in the list. From the point of speed and memory, always work with one,two arrays at a time. The rest must be waiting on the disk. So instead of object reference, have reference as a path and when needed, load it from disk.
Also, you said you dont want to assemble the list manually.
Possible solution, which I dont advice, but is possibly exactly what you are looking for
>>> a = np.zeros(shape = [10,5,3])
>>> b = np.zeros(shape = [7,7,9])
>>> c = [a,b]
>>> np.save('data.npy',c)
>>> d = np.load('data.npy')
>>> d.shape
(2,)
>>> type(d)
<type 'numpy.ndarray'>
>>> d.shape
(2,)
>>> d[0].shape
(10, 5, 3)
>>>
I believe I dont need to comment above mentioned code. However, after loading back, you will lose list as the list will be transformed into numpy array.

Python Storing Data

I have a list in my program. I have a function to append to the list, unfortunately when you close the program the thing you added goes away and the list goes back to the beginning. Is there any way that I can store the data so the user can re-open the program and the list is at its full.
You may try pickle module to store the memory data into disk,Here is an example:
store data:
import pickle
dataset = ['hello','test']
outputFile = 'test.data'
fw = open(outputFile, 'wb')
pickle.dump(dataset, fw)
fw.close()
load data:
import pickle
inputFile = 'test.data'
fd = open(inputFile, 'rb')
dataset = pickle.load(fd)
print dataset
You can make a database and save them, the only way is this. A database with SQLITE or a .txt file. For example:
with open("mylist.txt","w") as f: #in write mode
f.write("{}".format(mylist))
Your list goes into the format() function. It'll make a .txt file named mylist and will save your list data into it.
After that, when you want to access your data again, you can do:
with open("mylist.txt") as f: #in read mode, not in write mode, careful
rd=f.readlines()
print (rd)
The built-in pickle module provides some basic functionality for serialization, which is a term for turning arbitrary objects into something suitable to be written to disk. Check out the docs for Python 2 or Python 3.
Pickle isn't very robust though, and for more complex data you'll likely want to look into a database module like the built-in sqlite3 or a full-fledged object-relational mapping (ORM) like SQLAlchemy.
For storing big data, HDF5 library is suitable. It is implemented by h5py in Python.

fast file loading on python

I have two problems with loading data on python, both the scipts work properly but they need too much time to run and sometimes "Killed" is the result (with the first one).
I have a big zipped text file and I do something like this:
import gzip
import cPickle as pickle
f = gzip.open('filename.gz','r')
tab={}
for line in f:
#fill tab
with open("data_dict.pkl","wb") as g:
pickle.dump(tab,g)
f.close()
I have to do some operations on the dictionary I created in the previous script
import cPickle as pickle
with open("data_dict.pkl", "rb") as f:
tab = pickle.load(f)
f.close()
#operations on tab (the dictionary)
Do you have other solutionsin mind? Maybe not the ones involving YAML or JSON...
If the data you are pickling is primitive and simple, you can try marshal module: http://docs.python.org/3/library/marshal.html#module-marshal. That's what Python uses to serialize its bytecode, so it's pretty fast.
First one comment, in:
with open("data_dict.pkl", "rb") as f:
tab = pickle.load(f)
f.close()
f.close() is not necessary, the context manager (with syntax) does that automatically.
Now as for speed, I don't think you're going to get too much faster than cPickle for the purpose of reading in something from disk directly as a Python object. If this script needs to be run over and over I would try using memchached via pylibmc to keep the object stored persistently in memory so you can access it lightning fast:
import pylibmc
mc = pylibmc.Client(["127.0.0.1"], binary=True,behaviors={"tcp_nodelay": True,"ketama": True})
d = range(10000) ## some big object
mc["some_key"] = d ## save in memory
Then after saving it once you can access and modify it, it stays in memory even after the previous program finishes its execution:
import pylibmc
mc = pylibmc.Client(["127.0.0.1"], binary=True,behaviors={"tcp_nodelay": True,"ketama": True})
d = mc["some_key"] ## load from memory
d[0] = 'some other value' ## modify
mc["some_key"] = d ## save to memory again

Categories

Resources