Chain datasets from multiple HDF5 files/datasets

Chain datasets from multiple HDF5 files/datasets - python

The benefits and simplistic mapping that h5py provides (through HDF5) for persisting datasets on disk is exceptional. I run some analysis on a set of files and store the result into a dataset, one for each file. At the end of this step, I have a set of h5py.Dataset objects which contain 2D arrays. The arrays all have the same number of columns, but different number of rows, i.e., (A,N), (B,N), (C,N), etc.
I would now like to access these multiple 2D arrays as a single array 2D array. That is, I would like to read them on-demand as an array of shape (A+B+C, N).
For this purpose, h5py.Link classes do not help as it works at the level of HDF5 nodes.
Here is some pseudocode:
import numpy as np
import h5py
a = h5py.Dataset('a',data=np.random.random((100, 50)))
b = h5py.Dataset('b',data=np.random.random((300, 50)))
c = h5py.Dataset('c',data=np.random.random((253, 50)))
# I want to view these arrays as a single array
combined = magic_array_linker([a,b,c], axis=1)
assert combined.shape == (100+300+253, 50)
For my purposes, suggestions of copying the arrays into a new file do not work. I'm also open to solving this on the numpy level, but I don't find any suitable options with numpy.view or numpy.concatenate that would work without copying out the data.
Does anybody know of a way to view multiple arrays as a stacked set of arrays, without copying and from h5py.Dataset?

First up, I don't think there is a way to do this without copying the data in order to return a single array. As far as I can tell, it's not possible to concatenate numpy views into one array - unless, of course, you create your own wrapper.
Here I demonstrate a proof of concept using Object/Region references. The basic premise is that we make a new dataset in the file which is an array of references to the constituent subarrays. By storing references like this, the subarrays can change size dynamically and indexing the wrapper will always index the correct subarrays.
As this is just a proof of concept, I haven't implemented proper slicing, just very simple indexing. There's also no attempt at error checking - this will almost definitely break in production.
class MagicArray(object):
"""Magically index an array of references
"""
def __init__(self, file, references, axis=0):
self.file = file
self.references = references
self.axis = axis
def __getitem__(self, items):
# We need to modify the indices, so make sure items is a list
items = list(items)
for item in items:
if hasattr(item, 'start'):
# items is a slice object
raise ValueError('Slices not implemented')
for ref in self.references:
size = self.file[ref].shape[self.axis]
# Check if the requested index is in this subarray
# If not, subtract the subarray size and move on
if items[self.axis] < size:
item_ref = ref
break
else:
items[self.axis] = items[self.axis] - size
return self.file[item_ref][tuple(items)]
Here's how you use it:
with h5py.File("/tmp/so_hdf5/test.h5", 'w') as f:
a = f.create_dataset('a',data=np.random.random((100, 50)))
b = f.create_dataset('b',data=np.random.random((300, 50)))
c = f.create_dataset('c',data=np.random.random((253, 50)))
ref_dtype = h5py.special_dtype(ref=h5py.Reference)
ref_dataset = f.create_dataset("refs", (3,), dtype=ref_dtype)
for i, key in enumerate([a, b, c]):
ref_dataset[i] = key.ref
with h5py.File("/tmp/so_hdf5/test.h5", 'r') as f:
foo = MagicArray(f, f['refs'], axis=0)
print(foo[104, 4])
print(f['b'][4,4])
This should be fairly trivial to extend to fancier indexing (i.e. being able to handle slices), but I can't see how to do so without copying data.
You might be able to subclass from numpy.ndarray and get all the usual methods as well.

Related

Write a Scalar to CSV (Numpy)

I'm generating a number of test files iteratively, the process derives a 0, 1 or 2 dimensional numpy array, then writes that array to CSV, at least that's the intent.
Does anyone have a good solution for this?
My code (expectedly) fails when the output is zero-dimensional (scalar):
for key in testfiles:
tname = key + ".csv"
np.savetxt(tname, testfiles[key], delimiter=",",newline=';',fmt='%0.15f')

There are a couple of ways to ensure that your input is not a scalar in numpy. For example, you could use np.array:
arr = np.array(testfiles[key], ndmin=1, copy=False)
Another option is np.atleast_1d:
arr = np.atleast_1d(testfiles[key])
Both options will attempt to make an object without copying the data. In both cases, pass arr to np.savetxt instead of testfiles[key].

Python/Numpy: Build 2D array without adding duplicate rows (for triangular mesh)

I'm working on some code that manipulates 3D triangular meshes. Once I have imported mesh data, I need to "unify" vertices that are at the same point in space.
I've been assuming that numpy arrays would be the fastest way of storing & manipulating the data, but I can't seem to find a fast way of building a list of vertices while avoiding adding duplicate entries.
So, to test out methods, creating a 3x30000 array with 10000 unique rows:
import numpy as np
points = np.random.random((10000,3))
raw_data = np.concatenate((points,points,points))
np.random.shuffle(raw_data)
This serves as a good approximation of mesh data, with each point appearing as a facet vertex 3 times. While unifying, I need to build a list of unique vertices; if a point already is in the list a reference to it must be stored.
The best I've been able to come up with using numpy so far has been the following:
def unify(raw_data):
# first point must be new
unified_verts = np.zeros((1,3),dtype=np.float64)
unified_verts[0] = raw_data[0]
ref_list = [0]
for i in range(1,len(raw_data)):
point = raw_data[i]
index_array = np.where(np.all(point==unified_verts,axis=1))[0]
# point not in array yet
if len(index_array) == 0:
point = np.expand_dims(point,0)
unified_verts = np.concatenate((unified_verts,point))
ref_list.append(len(unified_verts)-1)
# point already exists
else:
ref_list.append(index_array[0])
return unified_verts, ref_list
Testing using cProfile:
import cProfile
cProfile.run("unify(raw_data)")
On my machine this runs in 5.275 seconds. I've though about using Cython to speed it up, but from what I've read Cython doesn't typically run much faster than numpy methods. Any advice on ways to do this more efficiently?

Jaime has shown a neat trick which can be used to view a 2D array as a 1D array with items that correspond to rows of the 2D array. This trick can allow you to apply numpy functions which take 1D arrays as input (such as np.unique) to higher dimensional arrays.
If the order of the rows in unified_verts does not matter (as long as the ref_list is correct with respect to unifed_verts), then you could use np.unique along with Jaime's trick like this:
def unify2(raw_data):
dtype = np.dtype((np.void, (raw_data.shape[1] * raw_data.dtype.itemsize)))
uniq, inv = np.unique(raw_data.view(dtype), return_inverse=True)
uniq = uniq.view(raw_data.dtype).reshape(-1, raw_data.shape[1])
return uniq, inv
The result is the same in the sense that the raw_data can be reconstructed from the return values of unify (or unify2):
unified, ref = unify(raw_data)
uniq, inv = unify2(raw_data)
assert np.allclose(uniq[inv], unified[ref]) # raw_data
On my machine, unified, ref = unify(raw_data) requires about 51.390s, while uniq, inv = unify2(raw_data) requires about 0.133s (~ 386x speedup).

How do I fill two (or more) numpy arrays from a single iterable of tuples?

The actual problem I have is that I want to store a long sorted list of (float, str) tuples in RAM. A plain list doesn't fit in my 4Gb RAM, so I thought I could use two numpy.ndarrays.
The source of the data is an iterable of 2-tuples. numpy has a fromiter function, but how can I use it? The number of items in the iterable is unknown. I can't consume it to a list first due to memory limitations. I thought of itertools.tee, but it seems to add a lot of memory overhead here.
What I guess I could do is consume the iterator in chunks and add those to the arrays. Then my question is, how to do that efficiently? Should I maybe make 2 2D arrays and add rows to them? (Then later I'd need to convert them to 1D).
Or maybe there's a better approach? Everything I really need is to search through an array of strings by the value of the corresponding number in logarithmic time (that's why I want to sort by the value of float) and to keep it as compact as possible.
P.S. The iterable is not sorted.

Perhaps build a single, structured array using np.fromiter:
import numpy as np
def gendata():
# You, of course, have a different gendata...
for i in xrange(N):
yield (np.random.random(), str(i))
N = 100
arr = np.fromiter(gendata(), dtype='<f8,|S20')
Sorting it by the first column, using the second for tie-breakers will take O(N log N) time:
arr.sort(order=['f0','f1'])
Finding the row by the value in the first column can be done with searchsorted in O(log N) time:
# Some pseudo-random value in arr['f0']
val = arr['f0'][10]
print(arr[10])
# (0.049875262239617246, '46')
idx = arr['f0'].searchsorted(val)
print(arr[idx])
# (0.049875262239617246, '46')
You've asked many important questions in the comments; let me attempt to answer them here:
The basic dtypes are explained in the numpybook. There may be one or
two extra dtypes (like float16 which have been added since that
book was written, but the basics are all explained there.)
Perhaps a more thorough discussion is in the online documentation. Which is a good supplement to the examples you mentioned here.
Dtypes can be used to define structured arrays with column names, or
with default column names. 'f0', 'f1', etc. are default column
names. Since I defined the dtype as '<f8,|S20' I failed to provide
column names, so NumPy named the first column 'f0', and the second
'f1'. If we had used
dtype='[('fval','<f8'), ('text','|S20')]
then the structured array arr would have column names 'fval' and
'text'.
Unfortunately, the dtype has to be fixed at the time np.fromiter is called. You
could conceivably iterate through gendata once to discover the
maximum length of the strings, build your dtype and then call
np.fromiter (and iterate through gendata a second time), but
that's rather burdensome. It is of course better if you know in
advance the maximum size of the strings. (|S20 defines the string
field as having a fixed length of 20 bytes.)
NumPy arrays place data of a
pre-defined size in arrays of a fixed size. Think of the array (even multidimensional ones) as a contiguous block of one-dimensional memory. (That's an oversimplification -- there are non-contiguous arrays -- but will help your imagination for the following.) NumPy derives much of its speed by taking advantage of the fixed sizes (set by the dtype) to quickly compute the offsets needed to access elements in the array. If the strings had variable sizes, then it
would be hard for NumPy to find the right offsets. By hard, I mean
NumPy would need an index or somehow be redesigned. NumPy is simply not
built this way.
NumPy does have an object dtype which allows you to place a 4-byte
pointer to any Python object you desire. This way, you can have NumPy
arrays with arbitrary Python data. Unfortunately, the np.fromiter
function does not allow you to create arrays of dtype object. I'm not sure why there is this restriction...
Note that np.fromiter has better performance when the count is
specified. By knowing the count (the number of rows) and the
dtype (and thus the size of each row) NumPy can pre-allocate
exactly enough memory for the resultant array. If you do not specify
the count, then NumPy will make a guess for the initial size of the
array, and if too small, it will try to resize the array. If the
original block of memory can be extended you are in luck. But if
NumPy has to allocate an entirely new hunk of memory then all the old
data will have to be copied to the new location, which will slow down
the performance significantly.

Here is a way to build N separate arrays out of a generator of N-tuples:
import numpy as np
import itertools as IT
def gendata():
# You, of course, have a different gendata...
N = 100
for i in xrange(N):
yield (np.random.random(), str(i))
def fromiter(iterable, dtype, chunksize=7):
chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype)
result = [chunk[name].copy() for name in chunk.dtype.names]
size = len(chunk)
while True:
chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype)
N = len(chunk)
if N == 0:
break
newsize = size + N
for arr, name in zip(result, chunk.dtype.names):
col = chunk[name]
arr.resize(newsize, refcheck=0)
arr[size:] = col
size = newsize
return result
x, y = fromiter(gendata(), '<f8,|S20')
order = np.argsort(x)
x = x[order]
y = y[order]
# Some pseudo-random value in x
N = 10
val = x[N]
print(x[N], y[N])
# (0.049875262239617246, '46')
idx = x.searchsorted(val)
print(x[idx], y[idx])
# (0.049875262239617246, '46')
The fromiter function above reads the iterable in chunks (of size chunksize). It calls the NumPy array method resize to extend the resultant arrays as necessary.
I used a small default chunksize since I was testing this code on small data. You, of course, will want to either change the default chunksize or pass a chunksize parameter with a larger value.

numpy: efficient execution of a complex reshape of an array

I am reading a vendor-provided large binary array into a 2D numpy array tempfid(M, N)
# load data
data=numpy.fromfile(file=dirname+'/fid', dtype=numpy.dtype('i4'))
# convert to complex data
fid=data[::2]+1j*data[1::2]
tempfid=fid.reshape(I*J*K, N)
and then I need to reshape it into a 4D array useful4d(N,I,J,K) using non-trivial mappings for the indices. I do this with a for loop along the following lines:
for idx in range(M):
i=f1(idx) # f1, f2, and f3 are functions involving / and % as well as some lookups
j=f2(idx)
k=f3(idx)
newfid[:,i,j,k] = tempfid[idx,:] #SLOW! CAN WE IMPROVE THIS?
Converting to complex takes 33% of the time while the copying of these slices M slices takes the remaining 66%. Calculating the indices is fast irrespective of whether I do this one by one in a loop as shown or by numpy.vectorizing the operation and applying it to an arange(M).
Is there a way to speed this up? Any help on more efficient slicing, copying (or not) etc appreciated.
EDIT:
As learned in the answer to question "What's the fastest way to convert an interleaved NumPy integer array to complex64?" the conversion to complex can be sped up by a factor of 6 if a view is used instead:
fid = data.astype(numpy.float32).view(numpy.complex64)

idx = numpy.arange(M)
i = numpy.vectorize(f1)(idx)
j = numpy.vectorize(f2)(idx)
k = numpy.vectorize(f3)(idx)
# you can index arrays with other arrays
# that lets you specify this operation in one line.
newfid[:, i,j,k] = tempfid.T
I've never used numpy's vectorize. Vectorize just means that numpy will call your python function multiple times. In order to get speed, you need use array operations like the one I showed here and you used to get complex numbers.
EDIT
The problem is that the dimension of size 128 was first in newfid, but last in tempfid. This is easily by using .T which takes the transpose.

How about this. Set us your indicies using the vectorized versions of f1,f2,f3 (not necessarily using np.vectorize, but perhaps just writing a function that takes an array and returns an array), then use np.ix_:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.ix_.html
to get the index arrays. Then reshape tempfid to the same shape as newfid and then use the results of np.ix_ to set the values. For example:
tempfid = np.arange(10)
i = f1(idx) # i = [4,3,2,1,0]
j = f2(idx) # j = [1,0]
ii = np.ix_(i,j)
newfid = tempfid.reshape((5,2))[ii]
This maps the elements of tempfid onto a new shape with a different ordering.

Best way to create a NumPy array from a dictionary?

I'm just starting with NumPy so I may be missing some core concepts...
What's the best way to create a NumPy array from a dictionary whose values are lists?
Something like this:
d = { 1: [10,20,30] , 2: [50,60], 3: [100,200,300,400,500] }
Should turn into something like:
data = [
[10,20,30,?,?],
[50,60,?,?,?],
[100,200,300,400,500]
]
I'm going to do some basic statistics on each row, eg:
deviations = numpy.std(data, axis=1)
Questions:
What's the best / most efficient way to create the numpy.array from the dictionary? The dictionary is large; a couple of million keys, each with ~20 items.
The number of values for each 'row' are different. If I understand correctly numpy wants uniform size, so what do I fill in for the missing items to make std() happy?
Update: One thing I forgot to mention - while the python techniques are reasonable (eg. looping over a few million items is fast), it's constrained to a single CPU. Numpy operations scale nicely to the hardware and hit all the CPUs, so they're attractive.

You don't need to create numpy arrays to call numpy.std().
You can call numpy.std() in a loop over all the values of your dictionary. The list will be converted to a numpy array on the fly to compute the standard variation.
The downside of this method is that the main loop will be in python and not in C. But I guess this should be fast enough: you will still compute std at C speed, and you will save a lot of memory as you won't have to store 0 values where you have variable size arrays.
If you want to further optimize this, you can store your values into a list of numpy arrays, so that you do the python list -> numpy array conversion only once.
if you find that this is still too slow, try to use psycho to optimize the python loop.
if this is still too slow, try using Cython together with the numpy module. This Tutorial claims impressive speed improvements for image processing. Or simply program the whole std function in Cython (see this for benchmarks and examples with sum function )
An alternative to Cython would be to use SWIG with numpy.i.
if you want to use only numpy and have everything computed at C level, try grouping all the records of same size together in different arrays and call numpy.std() on each of them. It should look like the following example.
example with O(N) complexity:
import numpy
list_size_1 = []
list_size_2 = []
for row in data.itervalues():
if len(row) == 1:
list_size_1.append(row)
elif len(row) == 2:
list_size_2.append(row)
list_size_1 = numpy.array(list_size_1)
list_size_2 = numpy.array(list_size_2)
std_1 = numpy.std(list_size_1, axis = 1)
std_2 = numpy.std(list_size_2, axis = 1)

While there are already some pretty reasonable ideas present here, I believe following is worth mentioning.
Filling missing data with any default value would spoil the statistical characteristics (std, etc). Evidently that's why Mapad proposed the nice trick with grouping same sized records.
The problem with it (assuming there isn't any a priori data on records lengths is at hand) is that it involves even more computations than the straightforward solution:
at least O(N*logN) 'len' calls and comparisons for sorting with an effective algorithm
O(N) checks on the second way through the list to obtain groups(their beginning and end indexes on the 'vertical' axis)
Using Psyco is a good idea (it's strikingly easy to use, so be sure to give it a try).
It seems that the optimal way is to take the strategy described by Mapad in bullet #1, but with a modification - not to generate the whole list, but iterate through the dictionary converting each row into numpy.array and performing required computations. Like this:
for row in data.itervalues():
np_row = numpy.array(row)
this_row_std = numpy.std(np_row)
# compute any other statistic descriptors needed and then save to some list
In any case a few million loops in python won't take as long as one might expect. Besides this doesn't look like a routine computation, so who cares if it takes extra second/minute if it is run once in a while or even just once.
A generalized variant of what was suggested by Mapad:
from numpy import array, mean, std
def get_statistical_descriptors(a):
if ax = len(shape(a))-1
functions = [mean, std]
return f(a, axis = ax) for f in functions
def process_long_list_stats(data):
import numpy
groups = {}
for key, row in data.iteritems():
size = len(row)
try:
groups[size].append(key)
except KeyError:
groups[size] = ([key])
results = []
for gr_keys in groups.itervalues():
gr_rows = numpy.array([data[k] for k in gr_keys])
stats = get_statistical_descriptors(gr_rows)
results.extend( zip(gr_keys, zip(*stats)) )
return dict(results)

numpy dictionary
You can use a structured array to preserve the ability to address a numpy object by a key, like a dictionary.
import numpy as np
dd = {'a':1,'b':2,'c':3}
dtype = eval('[' + ','.join(["('%s', float)" % key for key in dd.keys()]) + ']')
values = [tuple(dd.values())]
numpy_dict = np.array(values, dtype=dtype)
numpy_dict['c']
will now output
array([ 3.])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Chain datasets from multiple HDF5 files/datasets - python

Related

Write a Scalar to CSV (Numpy)

Python/Numpy: Build 2D array without adding duplicate rows (for triangular mesh)

How do I fill two (or more) numpy arrays from a single iterable of tuples?

numpy: efficient execution of a complex reshape of an array

Best way to create a NumPy array from a dictionary?

Categories

Resources