Allow_Pickle = True modified my dictionary to "unsized" when loaded - python

I am trying to save and load variables (dictionaries) to use in other notebooks. I save the variables with:
with open('opp2b.npy', 'wb') as f:
np.save(f, mak)
np.save(f, mp)
len(mak)
82
mak and mp are dictionaries with 82 entries of the same length.
When loading if not using allow_pickle = True it will not load. So I use this
with open('opp2b.npy', 'rb') as f:
mak = np.load(f, allow_pickle=True)
mp = np.load(f, allow_pickle=True)
and when I check the length I get
len(mak)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-16-bb967ce1f5ef> in <module>
----> 1 len(mak)
TypeError: len() of unsized object
1
I am not sure why the array is modified, but it is now unusable for what I need it.

Per your comments, mak is not a numpy array at all. numpy.save is specifically documented to:
Save an array to a binary file in NumPy .npy format.
allow_pickle is for numpy arrays containing Python objects, but the .npy format is not intended to store things that aren't numpy arrays at all. To successfully store the dict, it's wrapping it in a 0D numpy "array", and that's what np.load is giving you. You could extract the original dict by doing:
mak = mak.item(0) # mak = mak[0] doesn't work, and I'm unclear on why .item(0) works,
# as the docs claim the only difference is that .item(0) returns
# a Python scalar, rather than a numpy scalar, and that's not an
# issue here, but I assume something about 0D arrays requires this
But really, that's trying to put a square peg in a round hole. If you're not storing numpy arrays, there's little benefit to the .npy format, if any. The main advantages it provides are:
Avoiding arbitrary code execution for untrusted inputs (since you need to allow_pickle, that advantage goes away)
Allowing you to memory map on load (irrelevant when the entire data structure must be pickled anyway; memory mapping helps only for C level data where you might benefit from lazy loading and better performance if RAM grows short, as the data need not be written to swap before the pages are reclaimed)
(No longer relevant on modern Python) Stores array data more efficiently than the old pickle protocol 0 (that produced legal ASCII output, meaning only bytes of 127 or below, which made pickling raw binary data inefficient). As long as you're using protocol 2 or higher (which is binary, handles new-style classes efficiently, and is supported back to Python 2.3), it should store your data efficiently. As of Python 3.0, the default protocol is protocol 3 (rising to protocol 4 in 3.8), so if you're using a supported version of Python, and don't specify the protocol, it will use 3 or 4 (both of which work fine; protocol 4 being better if you're pickling huge objects).
Since you aren't storing numpy arrays, just rely on the pickle module directly to store arbitrary data (for modern pickle protocols, which allow efficient binary storage, numpy stores efficiently enough anyway, so the .npy format isn't helping much, if at all; for some trivial test cases I tried, saving {'a': numpy.array([0,1,2])}, the .npy dump was over twice the size).
import pickle # At top of file
with open('opp2b.pkl', 'wb') as f: # Name with common pickle extension instead of .npy
pickle.dump(mak, f) # Argument order reversed from np.save
pickle.dump(mp, f)
and then to load:
with open('opp2b.pkl', 'rb') as f: # Matching change in name
mak = pickle.load(f)
mp = pickle.load(f)
This assumes you might in fact want to load only one data set or the other at a time; if you plan to store and load both all the time, you may as well condense it to a single write of a tuple of the relevant values (increasing the chance that duplicated objects across the two objects can use back-references to avoid reserializing the same data multiple times), e.g.:
with open('opp2b.pkl', 'wb') as f:
pickle.dump((mak, mp), f)
and:
with open('opp2b.pkl', 'rb') as f:
mak, mp = pickle.load(f)

Related

h5py file subset taking more space than parent file?

I have an existing h5py file that I downloaded which is ~18G in size. It has a number of nested datasets within it:
h5f = h5py.File('input.h5', 'r')
data = h5f['data']
latlong_data = data['lat_long'].value
I want to be able to some basic min/max scaling of the numerical data within latlong, so i want to put it in its own h5py file for easier use and lower memory usage.
However, when i try to write it out to its own file:
out = h5py.File('latlong_only.h5', 'w')
out.create_dataset('latlong', data=latlong)
out.close()
The output file is incredibly large. It's still not done writing to disk and is ~85GB in space. Why is the data being written to the new file not compressed?
Could be h5f['data/lat_long'] is using compression filters (and you aren't). To check the original dataset's compression settings, use this line:
print (h5f['data/latlong'].compression, h5f['data/latlong'].compression_opts)
After writing my answer, it occurred to me that you don't need to copy the data to another file to reduce the memory footprint. Your code reads the dataset into an array, which is not necessary in most use cases. A h5py dataset object behaves similar to a NumPy array. Instead, use this line: ds = h5f1['data/latlong'] to create a dataset object (instead of an array) and use it "like" it's a NumPy array. FYI, .value is a deprecated method to return the dataset as an array. Use this syntax instead arr = h5f1['data/latlong'][()]. Loading the dataset into an array also requires more memory than using an h5py object (which could be an issue with large datasets).
There are other ways to access the data. My suggestion to use dataset objects is 1 way. Your method (extracting data to a new file) is another way. I am not found of that approach because you now have 2 copies of the data; a bookkeeping nightmare. Another alternative is to create external links from the new file to the existing 18GB file. That way you have a small file that links to the big file (and no duplicate data). I describe that method in this post: [How can I combine multiple .h5 file?][1] Method 1: Create External Links.
If you still want to copy the data, here is what I would do. Your code reads the dataset into an array then writes the array to the new file (uncompressed). Instead, copy the dataset using h5py's group .copy() method, it will retain compression settings and attributes.
See below:
with h5py.File('input.h5', 'r') as h5f1, \
h5py.File('latlong_only.h5', 'w') as h5f2:
h5f1.copy(h5f1['data/latlong'], h5f2,'latlong')

Python hangs silently on large file write

I am trying to write a big list of numpy nd_arrays to disk.
The list is ~50000 elements long
Each element is a nd_array of size (~2048,2) of ints. The arrays have different shapes.
The method I am (curently) using is
#staticmethod
def _write_with_yaml(path, obj):
with io.open(path, 'w+', encoding='utf8') as outfile:
yaml.dump(obj, outfile, default_flow_style=False, allow_unicode=True)
I have also tried pickle which also give the same problem:
On small lists (~3400 long), this works fine, finishes fast enough (<30 sec).
On ~6000 long lists, this finishes after ~2 minutes.
When the list gets larger, the process seems not to do anything. No change in RAM or disk activity.
I stopped waiting after 30 minutes.
After force stopping the process, the file suddenly became of significant size (~600MB).
I can't know if it finished writing or not.
What is the correct way to write such large lists, know if he write succeeded, and, if possible, knowing when the write/read is going to finish?
How can I debug what's happening when the process seems to hang?
I prefer not to break and assemble the lists manually in my code, I expect the serialization libraries to be able to do that for me.
For the code
import numpy as np
import yaml
x = []
for i in range(0,50000):
x.append(np.random.rand(2048,2))
print("Arrays generated")
with open("t.yaml", 'w+', encoding='utf8') as outfile:
yaml.dump(x, outfile, default_flow_style=False, allow_unicode=True)
on my system (MacOSX, i7, 16 GiB RAM, SSD) with Python 3.7 and PyYAML 3.13 the finish time is 61min. During the save the python process occupied around 5 GBytes of memory and final file size is 2 GBytes. This also shows the overhead of the file format: as the size of the data is 50k * 2048 * 2 * 8 (the size of a float is generally 64 bits in python) = 1562 MBytes, means yaml is around 1.3 times worse (and serialisation/deserialisation is also taking time).
To answer your questions:
There is no correct or incorrect way. To have a progress update and
estimation of finishing time is not easy (ex: other tasks might
interfere with the estimation, resources like memory could be used
up, etc.). You can rely on a library that supports that or implement
something yourself (as the other answer suggested)
Not sure "debug" is the correct term, as in practice it might be that the process just slow. Doing a performance analysis is not easy, especially if
using multiple/different libraries. What I would start with is clear
requirements: what do you want from the file saved? Do they need to
be yaml? Saving 50k arrays as yaml does not seem the best solution
if you care about performance. Should you ask yourself first "which is the best format for what I want?" (but you did not give details so can't say...)
Edit: if you want something just fast, use pickle. The code:
import numpy as np
import yaml
import pickle
x = []
for i in range(0,50000):
x.append(np.random.rand(2048,2))
print("Arrays generated")
pickle.dump( x, open( "t.yaml", "wb" ) )
finishes in 9 seconds, and generates a file of 1.5GBytes (no overhead). Of course pickle format should be used in very different circumstances than yaml...
I cant say this is the answer, but it may be it.
When I was working on app that required fast cycles, I found out that something in the code is very slow. It was opening / closing yaml files.
It was solved by using JSON.
Dont use YAML for anything else than as some kind of config you dont open often.
Solution to your array saving:
np.save(path,array) # path = path+name+'.npy'
If you really need to save a list of arrays, I recommend you to save list with array paths(array themselfs you will save on disk with np.save). Saving python objects on disk is not really what you want. What you want is to save numpy arrays with np.save
Complete solution(Saving example):
for array_index in range(len(list_of_arrays)):
np.save(array_index+'.npy',list_of_arrays[array_index])
# path = array_index+'.npy'
Complete solution(Loading example):
list_of_array_paths = ['1.npy','2.npy']
list_of_arrays = []
for array_path in list_of_array_paths:
list_of_arrays.append(np.load(array_path))
Further advice:
Python cant really handle large arrays. Moreover if you have loaded several of them in the list. From the point of speed and memory, always work with one,two arrays at a time. The rest must be waiting on the disk. So instead of object reference, have reference as a path and when needed, load it from disk.
Also, you said you dont want to assemble the list manually.
Possible solution, which I dont advice, but is possibly exactly what you are looking for
>>> a = np.zeros(shape = [10,5,3])
>>> b = np.zeros(shape = [7,7,9])
>>> c = [a,b]
>>> np.save('data.npy',c)
>>> d = np.load('data.npy')
>>> d.shape
(2,)
>>> type(d)
<type 'numpy.ndarray'>
>>> d.shape
(2,)
>>> d[0].shape
(10, 5, 3)
>>>
I believe I dont need to comment above mentioned code. However, after loading back, you will lose list as the list will be transformed into numpy array.

How to split large data and rejoin later

My code generates a list of numpy arrays of size (1, 1, n, n, m, m) where n may vary from 50-100 and m from 5-10 depending on the case at hand. The length of the list itself may go up to 10,000 and is being written/dumped using pickle at the end of the code. For cases at the higher end of these numbers or when file sizes go beyond 5-6 GB, I get Out of Memory error. Below is a made up example of the situation,
import numpy as np
list, list_length = [], 1000
n = 100
m = 3
for i in range(0, list_length):
list.append(np.random.random((1, 1, n, n, m, m)))
file_path = 'C:/Users/Desktop/Temp/'
with open(file_path, 'wb') as file:
pickle.dump(list, file)
I am looking for a way that helps me to
split the data so that I can get rid of memory error, and
rejoin the data in the original form when needed later
All I could think is:
for i in range(0, list_length):
data = np.random.random((1, 1, n, n, m, m))
file_path = 'C:/Users/Desktop/Temp/'+str(i)
with open(file_path, 'wb') as file:
pickle.dump(data, file)
and then combine using:
combined_list = []
for i in range(0, list_length):
file_path = 'C:/Users/Desktop/Temp/single' + str(i)
with open(file_path, 'rb') as file:
data = pickle.load(file)
combined_list.append(data)
Using this way, the file size certainly reduces due to multiple files, but that also increases processing time due to multiple file I/O operations.
Is there a more elegant and better way to do this?
Using savez, savez_compressed, or even things like h5py can be useful as #tel mentioned, but that takes extra effort trying to do "reinvent" caching mechanism. There are two easier ways to process larger-than-memory ndarray if applicable:
The easiest way is of course enable pagefile (or some other name) on Windows or swap on Linux (not sure about OS X counter part). This creates a virtually large enough memory so that you don't need to worry about memory at all. It will save to disk/load from disk accordingly
If the first way is not applicable due to not have admin rights or etc, numpy provides another way: np.memmap. This function maps an ndarray to disk such that you can index it just like it is in memory. Technically IO is done directly to the hard disk but OS will cache accordingly
For the second way, you can create a hard-disk side ndarray using:
np.memmap('yourFileName', 'float32', 'w+', 0, 2**32)
This creates a 16GB float32 array within no time (containing 4G numbers). You can then do IO to it. A lot of functions have an out parameter. You can set the out parameter accordingly so that the output is not "copied" to the disk from memory
If you want to save a list of ndarrays using the second method, either create a lot of memmaps, or concat them to a single array
Don't use pickle to store large data, it's not an efficient way to serialize anything. Instead, use the built-in numpy serialization formats/functions via the numpy.savez_compressed and numpy.load functions.
System memory isn't infinite, so at some point you'll still need to split your files (or use a heavier duty solution such as the one provided by the h5py package). However, if you were able to fit the original list into memory then savez_compressed and load should do what you need.

Save Numpy Array using Pickle

I've got a Numpy array that I would like to save (130,000 x 3) that I would like to save using Pickle, with the following code. However, I keep getting the error "EOFError: Ran out of input" or "UnsupportedOperation: read" at the pkl.load line. This is my first time using Pickle, any ideas?
Thanks,
Anant
import pickle as pkl
import numpy as np
arrayInput = np.zeros((1000,2)) #Trial input
save = True
load = True
filename = path + 'CNN_Input'
fileObject = open(fileName, 'wb')
if save:
pkl.dump(arrayInput, fileObject)
fileObject.close()
if load:
fileObject2 = open(fileName, 'wb')
modelInput = pkl.load(fileObject2)
fileObject2.close()
if arrayInput == modelInput:
Print(True)
You should use numpy.save and numpy.load.
I have no problems using pickle:
In [126]: arr = np.zeros((1000,2))
In [127]: with open('test.pkl','wb') as f:
...: pickle.dump(arr, f)
...:
In [128]: with open('test.pkl','rb') as f:
...: x = pickle.load(f)
...: print(x.shape)
...:
...:
(1000, 2)
pickle and np.save/load have a deep reciprocity. Like I can load this pickle with np.load:
In [129]: np.load('test.pkl').shape
Out[129]: (1000, 2)
If I open the pickle file in the wrong I do get your error:
In [130]: with open('test.pkl','wb') as f:
...: x = pickle.load(f)
...: print(x.shape)
...:
UnsupportedOperation: read
But that shouldn't be surprising - you can't read a freshly opened write file. It will be empty.
np.save/load is the usual pair for writing numpy arrays. But pickle uses save to serialize arrays, and save uses pickle to serialize non-array objects (in the array). Resulting file sizes are similar. Curiously in timings the pickle version is faster.
It's been a bit but if you're finding this, Pickle completes in a fraction of the time.
with open('filename','wb') as f: pickle.dump(arrayname, f)
with open('filename','rb') as f: arrayname1 = pickle.load(f)
numpy.array_equal(arrayname,arrayname1) #sanity check
On the other hand, by default numpy compress took my 5.2GB down to .4GB and Pickle went to 1.7GB.
Don't use pickle for numpy arrays, for an extended discussion that links to all resources I could find see my answer here.
Short reasons:
there is already a nice interface the developers of numpy made and will save you lots of time of debugging (most important reason)
np.save,np.load,np.savez have pretty good performance in most metrics, see this, which is to be expected since it's an established library and the developers of numpy made those functions.
Pickle executes arbitrary code and is a security issue
to use pickle you would have to open and file and might get issues that leads to bugs (e.g. I wasn't aware of using b and it stopped working, took time to debug)
if you refuse to accept this advice, at least really articulate the reason you need to use something else. Make sure it's crystal clear in your head.
Avoid repeating code at all costs if a solution already exists!
Anyway, here are all the interfaces I tried, hopefully it saves someone time (probably my future self):
import numpy as np
import pickle
from pathlib import Path
path = Path('~/data/tmp/').expanduser()
path.mkdir(parents=True, exist_ok=True)
lb,ub = -1,1
num_samples = 5
x = np.random.uniform(low=lb,high=ub,size=(1,num_samples))
y = x**2 + x + 2
# using save (to npy), savez (to npz)
np.save(path/'x', x)
np.save(path/'y', y)
np.savez(path/'db', x=x, y=y)
with open(path/'db.pkl', 'wb') as db_file:
pickle.dump(obj={'x':x, 'y':y}, file=db_file)
## using loading npy, npz files
x_loaded = np.load(path/'x.npy')
y_load = np.load(path/'y.npy')
db = np.load(path/'db.npz')
with open(path/'db.pkl', 'rb') as db_file:
db_pkl = pickle.load(db_file)
print(x is x_loaded)
print(x == x_loaded)
print(x == db['x'])
print(x == db_pkl['x'])
print('done')
but most useful see my answer here.
Here is one of extra possible ways. Sometimes you should add extra option protocol. For example,
import pickle
# Your array
arrayInput = np.zeros((1000,2))
Here is your approach:
pickle.dump(arrayInput, open('file_name.pickle', 'wb'))
Which you can change to:
# in two lines of code
with open("file_name.pickle", "wb") as f:
pickle.dump(arrayInput, f, protocol=pickle.HIGHEST_PROTOCOL)
or
# Save in line of code
pickle.dump(arrayInput, open("file_name.pickle", "wb"), protocol=pickle.HIGHEST_PROTOCOL)
Aftermath, you can easy read you numpy array like:
arrayInput = pickle.load(open(self._dir_models+"model_rf.sav", 'rb'))
Hope, it is useful for you
Many are forgetting one very important thing: security.
Pickled data is binary, so it gets run immediately upon using pickle.load. If loading from an untrusted source, the file could contain executable instructions to achieve things like man-in-the-middle attacks over a network, among other things. (e.g. see this realpython.com article)
Pure pickled data may be faster to save/load if you don't follow with bz2 compression, and hence have a larger file size, but numpy load/save may be more secure.
Alternatively, you may save purely pickled data along with an encryption key using the builtin hashlib and hmac libraries and, prior to loading, compare the hash key against your security key:
import hashlib
import hmac
def calculate_hash(
key_,
file_path,
hash_=hashlib.sha256
):
with open(file_path, "rb") as fp:
file_hash = hmac.new(key_, fp.read(), hash_).hexdigest()
return file_hash
def compare_hash(
hash1,
hash2,
):
"""
Warning:
Do not use `==` directly to compare hash values. Timing attacks can be used
to learn your security key. Use ``compare_digest()``.
"""
return hmac.compare_digest(hash1, hash2)
In a corporate setting, always be sure to confirm with your IT department. You want to be sure proper authentication, encryption, and authorization is all "set to go" when loading and saving data over servers and networks.
Pickle/CPickle
If you are confident you are using nothing but trusted sources and speed is a major concern over security and file size, pickle might be the way to go. In addition, you can take a few extra security measures using cPickle (this may have been incorporated directly into pickle in recent Python3 versions, but I'm not sure, so always double-check):
Use a cPickle.Unpickler instance, and set its "find_global" attribute to None to disable importing any modules (thus restricting loading to builtin types such as dict, int, list, string, etc).
Use a cPickle.Unpickler instance, and set its "find_global" attribute to a function that only allows importing of modules and names from a whitelist.
Use something like the itsdangerous package to authenticate the data before unpickling it if you're loading it from an untrusted source.
Numpy
If you are only saving numpy data and no other python data, and security is a greater priority over file size and speed, then numpy might be the way to go.
HDF5/H5PY
If your data is truly large and complex, hdf5 format via h5py is good.
JSON
And of course, this discussion wouldn't be complete without mentioning json. You may need to do extra work setting up encoding and decoding of your data, but nothing gets immediately run when you use json.load, so you can check the template/structure of the loaded data before you use it.
DISCLAIMER: I take no responsibility for end-user security with this provided information. The above information is for informational purposes only. Please use proper discretion and appropriate measures (including corporate policies, where applicable) with regard to security needs.
You should use numpy.save() for saving numpy matrices.
In your code, you're using
if load:
fileObject2 = open(fileName, 'wb')
modelInput = pkl.load(fileObject2)
fileObject2.close()
The second argument in the open function is the method. w stands for writing, r for reading. The second character b denotes that bytes will be read/written. A file that will be written to cannot be read and vice versa. Therefore, opening the file with fileObject2 = open(fileName, 'rb') will do the trick.
The easiest way to save and load a NumPy array -
# a numpy array
result.importances_mean
array([-1.43651529e-03, -2.73401297e-03, 9.26784059e-05, -7.41427247e-04,
3.56811863e-03, 2.78035218e-03, 3.70713624e-03, 5.51436515e-03,
1.16821131e-01, 9.26784059e-05, 9.26784059e-04, -1.80722892e-03,
-1.71455051e-03, -1.29749768e-03, -9.26784059e-05, -1.43651529e-03,
0.00000000e+00, -1.11214087e-03, -4.63392030e-05, -4.63392030e-04,
1.20481928e-03, 5.42168675e-03, -5.56070436e-04, 8.34105653e-04,
-1.85356812e-04, 0.00000000e+00, -9.73123262e-04, -1.43651529e-03,
-1.76088971e-03])
# save the array format - np.save(filename.npy, array)
np.save(os.path.join(model_path, "permutation_imp.npy"), result.importances_mean)
# load the array format - np.load(filename.npy)
res = np.load(os.path.join(model_path, "permutation_imp.npy"))
res
array([-1.43651529e-03, -2.73401297e-03, 9.26784059e-05, -7.41427247e-04,
3.56811863e-03, 2.78035218e-03, 3.70713624e-03, 5.51436515e-03,
1.16821131e-01, 9.26784059e-05, 9.26784059e-04, -1.80722892e-03,
-1.71455051e-03, -1.29749768e-03, -9.26784059e-05, -1.43651529e-03,
0.00000000e+00, -1.11214087e-03, -4.63392030e-05, -4.63392030e-04,
1.20481928e-03, 5.42168675e-03, -5.56070436e-04, 8.34105653e-04,
-1.85356812e-04, 0.00000000e+00, -9.73123262e-04, -1.43651529e-03,
-1.76088971e-03])

Performance issue with reading integers from a binary file at specific locations

I have a file with integers stored as binary and I'm trying to extract values at specific locations. It's one big serialized integer array for which I need values at specific indexes. I've created the following code but its terribly slow compared to the F# version I created before.
import os, struct
def read_values(filename, indices):
# indices are sorted and unique
values = []
with open(filename, 'rb') as f:
for index in indices:
f.seek(index*4L, os.SEEK_SET)
b = f.read(4)
v = struct.unpack("#i", b)[0]
values.append(v)
return values
For comparison here is the F# version:
open System
open System.IO
let readValue (reader:BinaryReader) cellIndex =
// set stream to correct location
reader.BaseStream.Position <- cellIndex*4L
match reader.ReadInt32() with
| Int32.MinValue -> None
| v -> Some(v)
let readValues fileName indices =
use reader = new BinaryReader(File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
// Use list or array to force creation of values (otherwise reader gets disposed before the values are read)
let values = List.map (readValue reader) (List.ofSeq indices)
values
Any tips on how to improve the performance of the python version, e.g. by usage of numpy ?
Update
Hdf5 works very good (from 5 seconds to 0.8 seconds on my test file):
import tables
def read_values_hdf5(filename, indices):
values = []
with tables.open_file(filename) as f:
dset = f.root.raster
return dset[indices]
Update 2
I went with the np.memmap because the performance is similar to hdf5 and I already have numpy in production.
Heavily depending on your index file size you might want to read it completely into a numpy array. If the file is not large, complete sequential read may be faster than a large number of seeks.
One problem with the seek operations is that python operates on buffered input. If the program was written in some lower level language, the use on unbuffered IO would be a good idea, as you only need a few values.
import numpy as np
# read the complete index into memory
index_array = np.fromfile("my_index", dtype=np.uint32)
# look up the indices you need (indices being a list of indices)
return index_array[indices]
If you would anyway read almost all pages (i.e. your indices are random and at a frequency of 1/1000 or more), this is probably faster. On the other hand, if you have a large index file, and you only want to pick a few indices, this is not so fast.
Then one more possibility - which might be the fastest - is to use the python mmap module. Then the file is memory-mapped, and only the pages really required are accessed.
It should be something like this:
import mmap
with open("my_index", "rb") as f:
memory_map = mmap.mmap(mmap.mmap(f.fileno(), 0)
for i in indices:
# the index at position i:
idx_value = struct.unpack('I', memory_map[4*i:4*i+4])
(Note, I did not actually test that one, so there may be typing errors. Also, I did not care about endianess, so please check it is correct.)
Happily, these can be combined by using numpy.memmap. It should keep your array on disk but give you numpyish indexing. It should be as easy as:
import numpy as np
index_arr = np.memmap(filename, dtype='uint32', mode='rb')
return index_arr[indices]
I think this should be the easiest and fastest alternative. However, if "fast" is important, please test and profile.
EDIT: As the mmap solution seems to gain some popularity, I'll add a few words about memory mapped files.
What is mmap?
Memory mapped files are not something uniquely pythonic, because memory mapping is something defined in the POSIX standard. Memory mapping is a way to use devices or files as if they were just areas in memory.
File memory mapping is a very efficient way to randomly access fixed-length data files. It uses the same technology as is used with virtual memory. The reads and writes are ordinary memory operations. If they point to a memory location which is not in the physical RAM memory ("page fault" occurs), the required file block (page) is read into memory.
The delay in random file access is mostly due to the physical rotation of the disks (SSD is another story). In average, the block you need is half a rotation away; for a typical HDD this delay is approximately 5 ms plus any data handling delay. The overhead introduced by using python instead of a compiled language is negligible compared to this delay.
If the file is read sequentially, the operating system usually uses a read-ahead cache to buffer the file before you even know you need it. For a randomly accessed big file this does not help at all. Memory mapping provides a very efficient way, because all blocks are loaded exactly when you need and remain in the cache for further use. (This could in principle happen with fseek, as well, because it might use the same technology behind the scenes. However, there is no guarantee, and there is anyway some overhead as the call wanders through the operating system.)
mmap can also be used to write files. It is very flexible in the sense that a single memory mapped file can be shared by several processes. This may be very useful and efficient in some situations, and mmap can also be used in inter-process communication. In that case usually no file is specified for mmap, instead the memory map is created with no file behind it.
mmap is not very well-known despite its usefulness and relative ease of use. It has, however, one important 'gotcha'. The file size has to remain constant. If it changes during mmap, odd things may happen.
Is the indices list sorted? i think you could get better performance if the list would be sorted, as you would make a lot less disk seeks

Categories

Resources