PyTables: indexing multiple dimensions of large arrays

PyTables: indexing multiple dimensions of large arrays - python

I'm analysing some imaging data that consists of large 3-dimensional arrays of pixel intensities with dimensions [frame, x, y]. Since these are usually too big to hold in memory, they reside on the hard disk as PyTables arrays.
What I'd like to be able to do is read out the intensities in an arbitrary subset of pixels across all frames. The natural way to do this seems to be list indexing:
import numpy as np
import tables
tmph5 = tables.open_file('temp.hdf5', 'w')
bigarray = tmph5.create_array('/', 'bigarray', np.random.randn(1000, 200, 100))
roipixels = [[0, 1, 2, 4, 6], [34, 35, 36, 40, 41]]
roidata = bigarray[:, roipixels[0], roipixels[1]]
# IndexError: Only one selection list is allowed
Unfortunately it seems that PyTables currently only supports a single set of list indices. A further problem is that a list index can't contain duplicates - I couldn't simultaneously read pixels [1, 2] and [1, 3], since my list of pixel x-coordinates would contain [1, 1]. I know that I can iterate over rows in the array:
roidata = np.asarray([row[roipixels[0], roipixels[1]] for row in bigarray])
but these iterative reads become quite slow for the large number of frames I'm processing.
Is there a nicer way of doing this? I'm relatively new to PyTables, so if you have any tips on organising datasets in large arrays I'd love to hear them.

For whatever it's worth, I often do the same thing with 3D seismic data stored in hdf format.
The iterative read is slow due to the nested loops. If you only do a single loop (rather than looping over each row) it's quite fast (at least when using h5py. I typically only store table-like data using pytables) and does exactly what you want.
In most cases, you'll want to iterate over your lists of indicies, rather than over each row.
Basically, you want:
roidata = np.vstack([bigarray[:,i,j] for i,j in zip(*roipixels)])
Instead of:
roidata = np.asarray([row[roipixels[0],roipixels[1]] for row in bigarray])
If this is your most common use case, adjusting the chunksize of the stored array will help dramatically. You'll want long, narrow chunks, with the longest length along the first axis, in your case.
(Caveat: I haven't tested this with pytables, but it works perfectly with h5py.)

Related

Split several times a numpy array into irregular fragments

As the title of the question suggests, I am trying to find an optimal (and possibly pythonic) way of splitting several times a one dimensional numpy array into several irregular fragments, provided the following conditions: the first split occurs into n fragments whose lengths l are contained in the LSHAPE array, the second split occurs in each one of the n previous fragments, but now each one of them is split regularly into m arrays. The corresponding values of m are stored in the MSHAPES array, in a way that the i-th m matches the i-th l. To best illustrate my problem, I include the solution I have found so far, which makes use of the numpy split method:
import numpy as np
# Define arrays (n = 3 in this example)
LSHAPE = np.array([5, 8, 3])
MSHAPE = np.array([4, 5, 2])
# Generate a random 1D array of the requiered lenght
LM_SHAP = np.sum(np.multiply(LSHAPE, MSHAPE))
REFDAT = np.random.uniform(-1, 1, size=LM_SHAP)
# Split twice the array (this is my solution so far)
SLICE_L = np.split(REFDAT, np.cumsum(np.multiply(LSHAPE, MSHAPE)))[0:-1]
SLICE_L_M = []
for idx, mfrags in enumerate(SLICE_L):
SLICE_L_M.append(np.split(mfrags, MSHAPE[idx]))
In the code above a random test array (REFDAT) is created to fulfill the requirements of the problem, and then subsequently split. The results are stored in the SLICE_L_M array. This solution works, but I think is hard to read and possibly not efficient, so I would like to know if it is possible to improve it. I have read some Stackoverflow threads which are related to this one (like this one and this one) but I think my problem is slightly different. Thanks in advance for your help and time.
Edit:
One can gain an average ~ 3% CPU time improvement if a list comprehension is used:
SLICE_L = np.split(REFDAT, np.cumsum(np.multiply(LSHAPE, MSHAPE)))[0:-1]
SLICE_L_M = [np.split(lval, mval) for lval, mval in zip(SLICE_L, MSHAPE)]

Python h5py - efficient access of arrays of ragged arrays

I have a large h5py file with several ragged arrays in a large dataset. The arrays have one of the following types:
# Create types of lists of variable length vectors
vardoub = h5py.special_dtype(vlen=np.dtype('double'))
varint = h5py.special_dtype(vlen=np.dtype('int8'))
Within an HDF5 group (grp), I create datasets of N jagged items, e.g.:
d = grp.create_dataset("predictions", (N,), dtype=vardoub)
and populate d[0], d[1], ..., d[N-1] with long numpy arrays (usually in the hundreds of millions).
Creating these arrays works well, my issue is related to access. If I want to access a slice from one of the arrays, e.g. d[0][5000:6000] or d[0][50, 89, 100], the memory usage goes through the roof and I believe that it is reading in large sections of the array; I can watch physical memory usage rise from 5-6 GB to 32 GB (size of RAM on the machine) very quickly. p = d[0] reads the whole array into memory, so I think this is happening and then it is indexing into it.
Is there a better way to do this? d[n]'s type is a numpy array and I cannot take a ref of it. I suspect that I could restructure the data so that I have groups for each of the indices e.g. '0/predictions', '1/predictions', ..., but I would prefer not to have to convert this if there is a reasonable alternative.
Thank you,
Marie

Is there any performance reason to use ndim 1 or 2 vectors in numpy?

This seems like a pretty basic question, but I didn't find anything related to it on stack. Apologies if I missed an existing question.
I've seen some mathematical/linear algebraic reasons why one might want to use numpy vectors "proper" (i.e. ndim 1), as opposed to row/column vectors (i.e. ndim 2).
But now I'm wondering: are there any (significant) efficiency reasons why one might pick one over the other? Or is the choice pretty much arbitrary in that respect?
(edit) To clarify: By "ndim 1 vs ndim 2 vectors" I mean representing a vector that contains, say, numbers 3 and 4 as either:
np.array([3, 4]) # ndim 1
np.array([[3, 4]]) # ndim 2
The numpy documentation seems to lean towards the first case as the default, but like I said, I'm wondering if there's any performance difference.

If you use numpy properly, then no - it is not a consideration.
If you look at the numpy internals documentation, you can see that
Numpy arrays consist of two major components, the raw array data (from now on, referred to as the data buffer), and the information about the raw array data. The data buffer is typically what people think of as arrays in C or Fortran, a contiguous (and fixed) block of memory containing fixed sized data items. Numpy also contains a significant set of data that describes how to interpret the data in the data buffer.
So, irrespective of the dimensions of the array, all data is stored in a continuous buffer. Now consider
a = np.array([1, 2, 3, 4])
and
b = np.array([[1, 2], [3, 4]])
It is true that accessing a[1] requires (slightly) less operations than b[1, 1] (as the translation of 1, 1 to the flat index requires some calculations), but, for high performance, vectorized operations are required anyway.
If you want to sum all elements in the arrays, then, in both case you would use the same thing: a.sum(), and b.sum(), and the sum would be over elements in contiguous memory anyway. Conversely, if the data is inherently 2d, then you could do things like b.sum(axis=1) to sum over rows. Doing this yourself in a 1d array would be error prone, and not more efficient.
So, basically a 2d array, if it is natural for the problem just gives greater functionality, with zero or negligible overhead.

How to iterate over list of slices?

I couldn't find the solution to a performance enhancement problem.
I have a 1D array and I would like to compute sums over sliding windows of indices, here is an example code:
import numpy as np
input = np.linspace(1, 100, 100)
list_of_indices = [[0, 10], [5, 15], [45, 50]] #just an example
output = np.array([input[idx[0]: idx[1]].sum() for idx in list_of_indices])
The computation of the output array is extremely slow compared to numpy vectorised built-in functions.
In real life my list_of_indices contains tens of thousands [lower bound, upper bound] pairs, and this loop is definitely the bottle-neck of a high performance python script.
How to deal with this, using numpy internal functions: like masks, clever np.einsum, or other stuff like these ?
Since I work in HPC field, I am also concerned by memory consumption.
Does anyone have an answer for this problem while respecting the performance requirements?

If:
input is about the same length as output or shorter
The output values have similar magnitude
...you could create a cumsum of your input values. Then the summations turn into subtractions.
cs = np.cumsum(input, dtype=float32) # or float64 if you need it
loi = np.array(list_of_indices, dtype=np.uint16)
output = cs[loi[:,1]] - cs[loi[:,0]]
The numerical hazard here is loss of precision if input has runs of large and tiny values. Then cumsum may not be accurate enough for you.

Here's a simple approach to try: Keep the same solution structure as you already have, which presumably works. Just make the storage creation and indexing more efficient. If you are summing many elements from input for most indexes, the summation ought to take more time than the for looping. For example:
# Put all the indices in a nice efficient structure:
idxx = np.hstack((np.array(list_of_indices, dtype=np.uint16),
np.arange(len(list_of_indices), dtype=np.uint16)[:,None]))
# Allocate appropriate data type to the precision and range you need,
# Do it in one go to be time-efficient
output = np.zeros(len(list_of_indices), dtype=np.float32)
for idx0, idx1, idxo in idxx:
output[idxo] = input[idx0:idx1].sum()
If len(list_if_indices) > 2**16, use uint32 rather than uint16.

numpy arrays: filling and extracting data quickly

See important clarification at bottom of this question.
I am using numpy to speed up some processing of longitude/latitude coordinates. Unfortunately, my numpy "optimizations" made my code run about 5x more slowly than it ran without using numpy.
The bottleneck seems to be in filling the numpy array with my data, and then extracting out that data after I have done the mathematical transformations. To fill the array I basically have a loop like:
point_list = GetMyPoints() # returns a long list of ( lon, lat ) coordinate pairs
n = len( point_list )
point_buffer = numpy.empty( ( n, 2 ), numpy.float32 )
for point_index in xrange( 0, n ):
point_buffer[ point_index ] = point_list[ point_index ]
That loop, just filling in the numpy array before even operating on it, is extremely slow, much slower than the entire computation was without numpy. (That is, it's not just the slowness of the python loop itself, but apparently some huge overhead in actually transferring each small block of data from python to numpy.) There is similar slowness on the other end; after I have processed the numpy arrays, I access each modified coordinate pair in a loop, again as
some_python_tuple = point_buffer[ index ]
Again that loop to pull the data out is much slower than the entire original computation without numpy. So, how do I actually fill the numpy array and extract data from the numpy array in a way that doesn't defeat the purpose of using numpy in the first place?
I am reading the data from a shape file using a C library that hands me the data as a regular python list. I understand that if the library handed me the coordinates already in a numpy array there would be no "filling" of the numpy array necessary. But unfortunately the starting point for me with the data is as a regular python list. And more to the point, in general I want to understand how you quickly fill a numpy array with data from within python.
Clarification
The loop shown above is actually oversimplified. I wrote it that way in this question because I wanted to focus on the problem I was seeing of trying to fill a numpy array slowly in a loop. I now understand that doing that is just slow.
In my actual application what I have is a shape file of coordinate points, and I have an API to retrieve the points for a given object. There are something like 200,000 objects. So I repeatedly call a function GetShapeCoords( i ) to get the coords for object i. This returns a list of lists, where each sublist is a list of lon/lat pairs, and the reason it's a list of lists is that some of the objects are multi-part (i.e., multi-polygon). Then, in my original code, as I read in each object's points, I was doing a transformation on each point by calling a regular python function, and then plotting the transformed points using PIL. The whole thing took about 20 seconds to draw all 200,000 polygons. Not terrible, but much room for improvement. I noticed that at least half of those 20 seconds were spent doing the transformation logic, so I thought I'd do that in numpy. And my original implementation was just to read in the objects one at a time, and keep appending all the points from the sublists into one big numpy array, which I then could do the math stuff on in numpy.
So, I now understand that simply passing a whole python list to numpy is the right way to set up a big array. But in my case I only read one object at a time. So one thing I could do is keep appending points together in a big python list of lists of lists. And then when I've compiled some large number of objects' points in this way (say, 10000 objects), I could simply assign that monster list to numpy.
So my question now is three parts:
(a) Is it true that numpy can take that big, irregularly shaped, list of lists of lists, and slurp it okay and quickly?
(b) I then want to be able to transform all the points in the leaves of that monster tree. What is the expression to get numpy to, for instance, "go into each sublist, and then into each subsublist, and then for each coordinate pair you find in those subsublists multiply the first (lon coordinate) by 0.5"? Can I do that?
(c) Finally, I need to get those transformed coordinates back out in order to plot them.
Winston's answer below seems to give some hint at how I might do this all using itertools. What I want to do is pretty much like what Winston does, flattening the list out. But I can't quite just flatten it out. When I go to draw the data, I need to be able to know when one polygon stops and the next starts. So, I think I could make it work if there were a way to quickly mark the end of each polygon (i.e., each subsublist) with a special coordinate pair like (-1000, -1000) or something like that. Then I could flatten with itertools as in Winston's answer, and then do the transforms in numpy. Then I need to actually draw from point to point using PIL, and here I think I'd need to reassign the modified numpy array back to a python list, and then iterate through that list in a regular python loop to do the drawing. Does that seem like my best option short of just writing a C module to handle all the reading and drawing for me in one step?

You describe your data as being "lists of lists of lists of coordinates". From this I'm guessing your extraction looks like this:
for x in points:
for y in x:
for Z in y:
# z is a tuple with GPS coordinates
Do this:
# initially, points is a list of lists of lists
points = itertools.chain.from_iterable(points)
# now points is an iterable producing lists
points = itertools.chain.from_iterable(points)
# now points is an iterable producing coordinates
points = itertools.chain.from_iterable(points)
# now points is an iterable producing individual floating points values
data = numpy.fromiter(points, float)
# data is a numpy array containing all the coordinates
data = data.reshape( data.size/2,2)
# data has now been reshaped to be an nx2 array
itertools and numpy.fromiter are both implemented in c and really efficient. As a result, this should do the transformation very quickly.
The second part of your question doesn't really indicate what you want do with the data. Indexing numpy array is slower then indexing python lists. You get speed by performing operations in mass on the data. Without knowing more about what you are doing with that data, its hard to suggest how to fix it.
UPDATE:
I've gone ahead and done everything using itertools and numpy. I am not responsible from any brain damage resulting from attempting to understand this code.
# firstly, we use imap to call GetMyPoints a bunch of times
objects = itertools.imap(GetMyPoints, xrange(100))
# next, we use itertools.chain to flatten it into all of the polygons
polygons = itertools.chain.from_iterable(objects)
# tee gives us two iterators over the polygons
polygons_a, polygons_b = itertools.tee(polygons)
# the lengths will be the length of each polygon
polygon_lengths = itertools.imap(len, polygons_a)
# for the actual points, we'll flatten the polygons into points
points = itertools.chain.from_iterable(polygons_b)
# then we'll flatten the points into values
values = itertools.chain.from_iterable(points)
# package all of that into a numpy array
all_points = numpy.fromiter(values, float)
# reshape the numpy array so we have two values for each coordinate
all_points = all_points.reshape(all_points.size // 2, 2)
# produce an iterator of lengths, but put a zero in front
polygon_positions = itertools.chain([0], polygon_lengths)
# produce another numpy array from this
# however, we take the cumulative sum
# so that each index will be the starting index of a polygon
polygon_positions = numpy.cumsum( numpy.fromiter(polygon_positions, int) )
# now for the transformation
# multiply the first coordinate of every point by *.5
all_points[:,0] *= .5
# now to get it out
# polygon_positions is all of the starting positions
# polygon_postions[1:] is the same, but shifted on forward,
# thus it gives us the end of each slice
# slice makes these all slice objects
slices = itertools.starmap(slice, itertools.izip(polygon_positions, polygon_positions[1:]))
# polygons produces an iterator which uses the slices to fetch
# each polygon
polygons = itertools.imap(all_points.__getitem__, slices)
# just iterate over the polygon normally
# each one will be a slice of the numpy array
for polygon in polygons:
draw_polygon(polygon)
You might find it best to deal with a single polygon at a time. Convert each polygon into a numpy array and do the vector operations on that. You'll probably get a significant speed advantage just doing that. Putting all of your data into numpy might be a little difficult.
This is more difficult then most numpy stuff because of your oddly shaped data. Numpy pretty much assumes a world of uniformly shaped data.

The point of using numpy arrays is to avoid as much as possible for loops. Writing for loops yourself will result in slow code, but with numpy arrays you can use predefined vectorized functions which are much faster (and easier!).
So for the conversion of a list to an array you can use:
point_buffer = np.array(point_list)
If the list contains elements like (lat, lon), then this will be converted to an array with two columns.
With that numpy array you can easily manipulate all elements at once. For example, to multiply the first element of each coordinate pair by 0.5 as in your question, you can do simply (assuming that the first elements are eg in the first column):
point_buffer[:,0] * 0.5

This will be faster:
numpy.array(point_buffer, dtype=numpy.float32)
Modifiy the array, not the list. It would obviously be better to avoid creating the list in the first place if possible.
Edit 1: profiling
Here is some test code that demonstrates just how efficiently numpy converts lists to arrays (it's good). And that my list-to-buffer idea is only comparable to what numpy does, not better.
import timeit
setup = '''
import numpy
import itertools
import struct
big_list = numpy.random.random((10000,2)).tolist()'''
old_way = '''
a = numpy.empty(( len(big_list), 2), numpy.float32)
for i,e in enumerate(big_list):
a[i] = e
'''
normal_way = '''
a = numpy.array(big_list, dtype=numpy.float32)
'''
iter_way = '''
chain = itertools.chain.from_iterable(big_list)
a = numpy.fromiter(chain, dtype=numpy.float32)
'''
my_way = '''
chain = itertools.chain.from_iterable(big_list)
buffer = struct.pack('f'*len(big_list)*2,*chain)
a = numpy.frombuffer(buffer, numpy.float32)
'''
for way in [old_way, normal_way, iter_way, my_way]:
print timeit.Timer(way, setup).timeit(1)
results:
0.22445492374
0.00450378469941
0.00523579114088
0.00451488946237
Edit 2: Regarding the hierarchical nature of the data
If i understand that the data is always a list of lists of lists (object - polygon - coordinate), then this is the approach I'd take: Reduce the data to the lowest dimension that creates a square array (2D in this case) and track the indices of the higher-level branches with a separate array. This is essentially an implementation of Winston's idea of using numpy.fromiter of a itertools chain object. The only added idea is the branch indexing.
import numpy, itertools
# heirarchical list of lists of coord pairs
polys = [numpy.random.random((n,2)).tolist() for n in [5,7,12,6]]
# get the indices of the polygons:
lengs = numpy.array([0]+[len(l) for l in polys])
p_idxs = numpy.add.accumulate(lengs)
# convert the flattend list to an array:
chain = itertools.chain.from_iterable
a = numpy.fromiter(chain(chain(polys)), dtype=numpy.float32).reshape(lengs.sum(), 2)
# transform the coords
a *= .5
# get a transformed polygon (using the indices)
def get_poly(n):
i0 = p_idxs[n]
i1 = p_idxs[n+1]
return a[i0:i1]
print 'poly2', get_poly(2)
print 'poly0', get_poly(0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.