numpy arrays: filling and extracting data quickly

numpy arrays: filling and extracting data quickly - python

See important clarification at bottom of this question.
I am using numpy to speed up some processing of longitude/latitude coordinates. Unfortunately, my numpy "optimizations" made my code run about 5x more slowly than it ran without using numpy.
The bottleneck seems to be in filling the numpy array with my data, and then extracting out that data after I have done the mathematical transformations. To fill the array I basically have a loop like:
point_list = GetMyPoints() # returns a long list of ( lon, lat ) coordinate pairs
n = len( point_list )
point_buffer = numpy.empty( ( n, 2 ), numpy.float32 )
for point_index in xrange( 0, n ):
point_buffer[ point_index ] = point_list[ point_index ]
That loop, just filling in the numpy array before even operating on it, is extremely slow, much slower than the entire computation was without numpy. (That is, it's not just the slowness of the python loop itself, but apparently some huge overhead in actually transferring each small block of data from python to numpy.) There is similar slowness on the other end; after I have processed the numpy arrays, I access each modified coordinate pair in a loop, again as
some_python_tuple = point_buffer[ index ]
Again that loop to pull the data out is much slower than the entire original computation without numpy. So, how do I actually fill the numpy array and extract data from the numpy array in a way that doesn't defeat the purpose of using numpy in the first place?
I am reading the data from a shape file using a C library that hands me the data as a regular python list. I understand that if the library handed me the coordinates already in a numpy array there would be no "filling" of the numpy array necessary. But unfortunately the starting point for me with the data is as a regular python list. And more to the point, in general I want to understand how you quickly fill a numpy array with data from within python.
Clarification
The loop shown above is actually oversimplified. I wrote it that way in this question because I wanted to focus on the problem I was seeing of trying to fill a numpy array slowly in a loop. I now understand that doing that is just slow.
In my actual application what I have is a shape file of coordinate points, and I have an API to retrieve the points for a given object. There are something like 200,000 objects. So I repeatedly call a function GetShapeCoords( i ) to get the coords for object i. This returns a list of lists, where each sublist is a list of lon/lat pairs, and the reason it's a list of lists is that some of the objects are multi-part (i.e., multi-polygon). Then, in my original code, as I read in each object's points, I was doing a transformation on each point by calling a regular python function, and then plotting the transformed points using PIL. The whole thing took about 20 seconds to draw all 200,000 polygons. Not terrible, but much room for improvement. I noticed that at least half of those 20 seconds were spent doing the transformation logic, so I thought I'd do that in numpy. And my original implementation was just to read in the objects one at a time, and keep appending all the points from the sublists into one big numpy array, which I then could do the math stuff on in numpy.
So, I now understand that simply passing a whole python list to numpy is the right way to set up a big array. But in my case I only read one object at a time. So one thing I could do is keep appending points together in a big python list of lists of lists. And then when I've compiled some large number of objects' points in this way (say, 10000 objects), I could simply assign that monster list to numpy.
So my question now is three parts:
(a) Is it true that numpy can take that big, irregularly shaped, list of lists of lists, and slurp it okay and quickly?
(b) I then want to be able to transform all the points in the leaves of that monster tree. What is the expression to get numpy to, for instance, "go into each sublist, and then into each subsublist, and then for each coordinate pair you find in those subsublists multiply the first (lon coordinate) by 0.5"? Can I do that?
(c) Finally, I need to get those transformed coordinates back out in order to plot them.
Winston's answer below seems to give some hint at how I might do this all using itertools. What I want to do is pretty much like what Winston does, flattening the list out. But I can't quite just flatten it out. When I go to draw the data, I need to be able to know when one polygon stops and the next starts. So, I think I could make it work if there were a way to quickly mark the end of each polygon (i.e., each subsublist) with a special coordinate pair like (-1000, -1000) or something like that. Then I could flatten with itertools as in Winston's answer, and then do the transforms in numpy. Then I need to actually draw from point to point using PIL, and here I think I'd need to reassign the modified numpy array back to a python list, and then iterate through that list in a regular python loop to do the drawing. Does that seem like my best option short of just writing a C module to handle all the reading and drawing for me in one step?

You describe your data as being "lists of lists of lists of coordinates". From this I'm guessing your extraction looks like this:
for x in points:
for y in x:
for Z in y:
# z is a tuple with GPS coordinates
Do this:
# initially, points is a list of lists of lists
points = itertools.chain.from_iterable(points)
# now points is an iterable producing lists
points = itertools.chain.from_iterable(points)
# now points is an iterable producing coordinates
points = itertools.chain.from_iterable(points)
# now points is an iterable producing individual floating points values
data = numpy.fromiter(points, float)
# data is a numpy array containing all the coordinates
data = data.reshape( data.size/2,2)
# data has now been reshaped to be an nx2 array
itertools and numpy.fromiter are both implemented in c and really efficient. As a result, this should do the transformation very quickly.
The second part of your question doesn't really indicate what you want do with the data. Indexing numpy array is slower then indexing python lists. You get speed by performing operations in mass on the data. Without knowing more about what you are doing with that data, its hard to suggest how to fix it.
UPDATE:
I've gone ahead and done everything using itertools and numpy. I am not responsible from any brain damage resulting from attempting to understand this code.
# firstly, we use imap to call GetMyPoints a bunch of times
objects = itertools.imap(GetMyPoints, xrange(100))
# next, we use itertools.chain to flatten it into all of the polygons
polygons = itertools.chain.from_iterable(objects)
# tee gives us two iterators over the polygons
polygons_a, polygons_b = itertools.tee(polygons)
# the lengths will be the length of each polygon
polygon_lengths = itertools.imap(len, polygons_a)
# for the actual points, we'll flatten the polygons into points
points = itertools.chain.from_iterable(polygons_b)
# then we'll flatten the points into values
values = itertools.chain.from_iterable(points)
# package all of that into a numpy array
all_points = numpy.fromiter(values, float)
# reshape the numpy array so we have two values for each coordinate
all_points = all_points.reshape(all_points.size // 2, 2)
# produce an iterator of lengths, but put a zero in front
polygon_positions = itertools.chain([0], polygon_lengths)
# produce another numpy array from this
# however, we take the cumulative sum
# so that each index will be the starting index of a polygon
polygon_positions = numpy.cumsum( numpy.fromiter(polygon_positions, int) )
# now for the transformation
# multiply the first coordinate of every point by *.5
all_points[:,0] *= .5
# now to get it out
# polygon_positions is all of the starting positions
# polygon_postions[1:] is the same, but shifted on forward,
# thus it gives us the end of each slice
# slice makes these all slice objects
slices = itertools.starmap(slice, itertools.izip(polygon_positions, polygon_positions[1:]))
# polygons produces an iterator which uses the slices to fetch
# each polygon
polygons = itertools.imap(all_points.__getitem__, slices)
# just iterate over the polygon normally
# each one will be a slice of the numpy array
for polygon in polygons:
draw_polygon(polygon)
You might find it best to deal with a single polygon at a time. Convert each polygon into a numpy array and do the vector operations on that. You'll probably get a significant speed advantage just doing that. Putting all of your data into numpy might be a little difficult.
This is more difficult then most numpy stuff because of your oddly shaped data. Numpy pretty much assumes a world of uniformly shaped data.

The point of using numpy arrays is to avoid as much as possible for loops. Writing for loops yourself will result in slow code, but with numpy arrays you can use predefined vectorized functions which are much faster (and easier!).
So for the conversion of a list to an array you can use:
point_buffer = np.array(point_list)
If the list contains elements like (lat, lon), then this will be converted to an array with two columns.
With that numpy array you can easily manipulate all elements at once. For example, to multiply the first element of each coordinate pair by 0.5 as in your question, you can do simply (assuming that the first elements are eg in the first column):
point_buffer[:,0] * 0.5

This will be faster:
numpy.array(point_buffer, dtype=numpy.float32)
Modifiy the array, not the list. It would obviously be better to avoid creating the list in the first place if possible.
Edit 1: profiling
Here is some test code that demonstrates just how efficiently numpy converts lists to arrays (it's good). And that my list-to-buffer idea is only comparable to what numpy does, not better.
import timeit
setup = '''
import numpy
import itertools
import struct
big_list = numpy.random.random((10000,2)).tolist()'''
old_way = '''
a = numpy.empty(( len(big_list), 2), numpy.float32)
for i,e in enumerate(big_list):
a[i] = e
'''
normal_way = '''
a = numpy.array(big_list, dtype=numpy.float32)
'''
iter_way = '''
chain = itertools.chain.from_iterable(big_list)
a = numpy.fromiter(chain, dtype=numpy.float32)
'''
my_way = '''
chain = itertools.chain.from_iterable(big_list)
buffer = struct.pack('f'*len(big_list)*2,*chain)
a = numpy.frombuffer(buffer, numpy.float32)
'''
for way in [old_way, normal_way, iter_way, my_way]:
print timeit.Timer(way, setup).timeit(1)
results:
0.22445492374
0.00450378469941
0.00523579114088
0.00451488946237
Edit 2: Regarding the hierarchical nature of the data
If i understand that the data is always a list of lists of lists (object - polygon - coordinate), then this is the approach I'd take: Reduce the data to the lowest dimension that creates a square array (2D in this case) and track the indices of the higher-level branches with a separate array. This is essentially an implementation of Winston's idea of using numpy.fromiter of a itertools chain object. The only added idea is the branch indexing.
import numpy, itertools
# heirarchical list of lists of coord pairs
polys = [numpy.random.random((n,2)).tolist() for n in [5,7,12,6]]
# get the indices of the polygons:
lengs = numpy.array([0]+[len(l) for l in polys])
p_idxs = numpy.add.accumulate(lengs)
# convert the flattend list to an array:
chain = itertools.chain.from_iterable
a = numpy.fromiter(chain(chain(polys)), dtype=numpy.float32).reshape(lengs.sum(), 2)
# transform the coords
a *= .5
# get a transformed polygon (using the indices)
def get_poly(n):
i0 = p_idxs[n]
i1 = p_idxs[n+1]
return a[i0:i1]
print 'poly2', get_poly(2)
print 'poly0', get_poly(0)

Related

Numpy notation to replace an enumerate(zip(....))

I'm starting to use numpy. I get the slice notations and element-wise computations, but I can't understand this:
for i, (I,J) in enumerate(zip(data_list[0], data_list[1])):
joint_hist[int(np.floor(I/self.bin_size))][int(np.floor(J/self.bin_size))] += 1
Variables:
data_list contains two np.array().flatten() images (eventually more)
joint_hist[] is the joint histogram of those two images, it's displayed later with plt.imshow()
bin_size is the number of slots in the histogram
I can't understand why the coordinate in the final histogram is I,J. So it's not just that the value at a position in joint_hist[] is the result of some slicing/element-wise computation. I need to take the result of that computation and use THAT as the indices in joint_hist...
EDIT:
I indeed do not use the i in the loop actually - it's a leftover from previous iterations and I simply hadn't noticed I didn't need it anymore
I do want to remain in control of the bin sizes & the details of how this is done, so not particularly looking to use histogramm2D. I will later be using that for further image processing, so I'd rather have the flexibility to adapt my approach than have to figure out if/how to do particular things with built-in functions.

You can indeed gussy up that for loop using some numpy notation. Assuming you don't actually need i (since it isn't used anywhere):
for I,J in (data_list.T // self.bin_size).astype(int):
joint_hist[I, J] += 1
Explanation
data_list.T flips data_list on its side. Each row of data_list.T will contain the data for the pixels at a particular coordinate.
data_list.T // self.bin_size will produce the same result as np.floor(I/self.bin_size), only it will operate on all of the pixels at once, instead of one at a time.
.astype(int) does the same thing as int(...), but again operates on the entire array instead of a single element.
When you iterate over a 2D array with a for loop, the rows are returned one at a time. Thus, the for I,J in arr syntax will give you back one pair of pixels at a time, just like your zip statement did originally.
Alternative
You could also just use histogramdd to calculate joint_hist, in place of your for loop. For your application it would look like:
import numpy as np
joint_hist,edges = np.histogramdd(data_list.T)
This would have different bins than the ones you specified above, though (numpy would determine them automatically).

If I understand, your goal is to make an histogram or correlated values in your images? Well, to achieve the right bin index, the computation that you used is not valid. Instead of np.floor(I/self.bin_size), use np.floor(I/(I_max/bin_size)).astype(int). You want to divide I and J by their respective resolution. The result that you will get is a diagonal matrix for joint_hist if both data_list[0] and data_list[1] are the same flattened image.
So all put together:
I_max = data_list[0].max()+1
J_max = data_list[1].max()+1
joint_hist = np.zeros((I_max, J_max))
bin_size = 256
for i, (I, J) in enumerate(zip(data_list[0], data_list[1])):
joint_hist[np.floor(I / (I_max / bin_size)).astype(int), np.floor(J / (J_max / bin_size)).astype(int)] += 1

Array operations using multiple indices of same array

I am very new to Python, and I am trying to get used to performing Python's array operations rather than looping through arrays. Below is an example of the kind of looping operation I am doing, but am unable to work out a suitable pure array operation that does not rely on loops:
import numpy as np
def f(arg1, arg2):
# an arbitrary function
def myFunction(a1DNumpyArray):
A = a1DNumpyArray
# Create a square array with each dimension the size of the argument array.
B = np.zeros((A.size, A.size))
# Function f is a function of two elements of the 1D array. For each
# element, i, I want to perform the function on it and every element
# before it, and store the result in the square array, multiplied by
# the difference between the ith and (i-1)th element.
for i in range(A.size):
B[i,:i] = f(A[i], A[:i])*(A[i]-A[i-1])
# Sum through j and return full sums as 1D array.
return np.sum(B, axis=0)
In short, I am integrating a function which takes two elements of the same array as arguments, returning an array of results of the integral.
Is there a more compact way to do this, without using loops?

The use of an arbitrary f function, and this [i, :i] business complicates by passing a loop.
Most of the fast compiled numpy operations work on the whole array, or whole rows and/or columns, and effectively do so in parallel. Loops that are inherently sequential (value from one loop depends on the previous) don't fit well. And different size lists or arrays in each loop are also a good indicator that 'vectorizing' will be difficult.
for i in range(A.size):
B[i,:i] = f(A[i], A[:i])*(A[i]-A[i-1])
With a sample A and known f (as simple as arg1*arg2), I'd generate a B array, and look for patterns that treat B as a whole. At first glance it looks like your B is a lower triangle. There are functions to help index those. But that final sum might change the picture.
Sometimes I tackle these problems with a bottom up approach, trying to remove inner loops first. But in this case, I think some sort of big-picture approach is needed.

Python/Numpy: Build 2D array without adding duplicate rows (for triangular mesh)

I'm working on some code that manipulates 3D triangular meshes. Once I have imported mesh data, I need to "unify" vertices that are at the same point in space.
I've been assuming that numpy arrays would be the fastest way of storing & manipulating the data, but I can't seem to find a fast way of building a list of vertices while avoiding adding duplicate entries.
So, to test out methods, creating a 3x30000 array with 10000 unique rows:
import numpy as np
points = np.random.random((10000,3))
raw_data = np.concatenate((points,points,points))
np.random.shuffle(raw_data)
This serves as a good approximation of mesh data, with each point appearing as a facet vertex 3 times. While unifying, I need to build a list of unique vertices; if a point already is in the list a reference to it must be stored.
The best I've been able to come up with using numpy so far has been the following:
def unify(raw_data):
# first point must be new
unified_verts = np.zeros((1,3),dtype=np.float64)
unified_verts[0] = raw_data[0]
ref_list = [0]
for i in range(1,len(raw_data)):
point = raw_data[i]
index_array = np.where(np.all(point==unified_verts,axis=1))[0]
# point not in array yet
if len(index_array) == 0:
point = np.expand_dims(point,0)
unified_verts = np.concatenate((unified_verts,point))
ref_list.append(len(unified_verts)-1)
# point already exists
else:
ref_list.append(index_array[0])
return unified_verts, ref_list
Testing using cProfile:
import cProfile
cProfile.run("unify(raw_data)")
On my machine this runs in 5.275 seconds. I've though about using Cython to speed it up, but from what I've read Cython doesn't typically run much faster than numpy methods. Any advice on ways to do this more efficiently?

Jaime has shown a neat trick which can be used to view a 2D array as a 1D array with items that correspond to rows of the 2D array. This trick can allow you to apply numpy functions which take 1D arrays as input (such as np.unique) to higher dimensional arrays.
If the order of the rows in unified_verts does not matter (as long as the ref_list is correct with respect to unifed_verts), then you could use np.unique along with Jaime's trick like this:
def unify2(raw_data):
dtype = np.dtype((np.void, (raw_data.shape[1] * raw_data.dtype.itemsize)))
uniq, inv = np.unique(raw_data.view(dtype), return_inverse=True)
uniq = uniq.view(raw_data.dtype).reshape(-1, raw_data.shape[1])
return uniq, inv
The result is the same in the sense that the raw_data can be reconstructed from the return values of unify (or unify2):
unified, ref = unify(raw_data)
uniq, inv = unify2(raw_data)
assert np.allclose(uniq[inv], unified[ref]) # raw_data
On my machine, unified, ref = unify(raw_data) requires about 51.390s, while uniq, inv = unify2(raw_data) requires about 0.133s (~ 386x speedup).

numpy: efficient execution of a complex reshape of an array

I am reading a vendor-provided large binary array into a 2D numpy array tempfid(M, N)
# load data
data=numpy.fromfile(file=dirname+'/fid', dtype=numpy.dtype('i4'))
# convert to complex data
fid=data[::2]+1j*data[1::2]
tempfid=fid.reshape(I*J*K, N)
and then I need to reshape it into a 4D array useful4d(N,I,J,K) using non-trivial mappings for the indices. I do this with a for loop along the following lines:
for idx in range(M):
i=f1(idx) # f1, f2, and f3 are functions involving / and % as well as some lookups
j=f2(idx)
k=f3(idx)
newfid[:,i,j,k] = tempfid[idx,:] #SLOW! CAN WE IMPROVE THIS?
Converting to complex takes 33% of the time while the copying of these slices M slices takes the remaining 66%. Calculating the indices is fast irrespective of whether I do this one by one in a loop as shown or by numpy.vectorizing the operation and applying it to an arange(M).
Is there a way to speed this up? Any help on more efficient slicing, copying (or not) etc appreciated.
EDIT:
As learned in the answer to question "What's the fastest way to convert an interleaved NumPy integer array to complex64?" the conversion to complex can be sped up by a factor of 6 if a view is used instead:
fid = data.astype(numpy.float32).view(numpy.complex64)

idx = numpy.arange(M)
i = numpy.vectorize(f1)(idx)
j = numpy.vectorize(f2)(idx)
k = numpy.vectorize(f3)(idx)
# you can index arrays with other arrays
# that lets you specify this operation in one line.
newfid[:, i,j,k] = tempfid.T
I've never used numpy's vectorize. Vectorize just means that numpy will call your python function multiple times. In order to get speed, you need use array operations like the one I showed here and you used to get complex numbers.
EDIT
The problem is that the dimension of size 128 was first in newfid, but last in tempfid. This is easily by using .T which takes the transpose.

How about this. Set us your indicies using the vectorized versions of f1,f2,f3 (not necessarily using np.vectorize, but perhaps just writing a function that takes an array and returns an array), then use np.ix_:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.ix_.html
to get the index arrays. Then reshape tempfid to the same shape as newfid and then use the results of np.ix_ to set the values. For example:
tempfid = np.arange(10)
i = f1(idx) # i = [4,3,2,1,0]
j = f2(idx) # j = [1,0]
ii = np.ix_(i,j)
newfid = tempfid.reshape((5,2))[ii]
This maps the elements of tempfid onto a new shape with a different ordering.

Best way to create a NumPy array from a dictionary?

I'm just starting with NumPy so I may be missing some core concepts...
What's the best way to create a NumPy array from a dictionary whose values are lists?
Something like this:
d = { 1: [10,20,30] , 2: [50,60], 3: [100,200,300,400,500] }
Should turn into something like:
data = [
[10,20,30,?,?],
[50,60,?,?,?],
[100,200,300,400,500]
]
I'm going to do some basic statistics on each row, eg:
deviations = numpy.std(data, axis=1)
Questions:
What's the best / most efficient way to create the numpy.array from the dictionary? The dictionary is large; a couple of million keys, each with ~20 items.
The number of values for each 'row' are different. If I understand correctly numpy wants uniform size, so what do I fill in for the missing items to make std() happy?
Update: One thing I forgot to mention - while the python techniques are reasonable (eg. looping over a few million items is fast), it's constrained to a single CPU. Numpy operations scale nicely to the hardware and hit all the CPUs, so they're attractive.

You don't need to create numpy arrays to call numpy.std().
You can call numpy.std() in a loop over all the values of your dictionary. The list will be converted to a numpy array on the fly to compute the standard variation.
The downside of this method is that the main loop will be in python and not in C. But I guess this should be fast enough: you will still compute std at C speed, and you will save a lot of memory as you won't have to store 0 values where you have variable size arrays.
If you want to further optimize this, you can store your values into a list of numpy arrays, so that you do the python list -> numpy array conversion only once.
if you find that this is still too slow, try to use psycho to optimize the python loop.
if this is still too slow, try using Cython together with the numpy module. This Tutorial claims impressive speed improvements for image processing. Or simply program the whole std function in Cython (see this for benchmarks and examples with sum function )
An alternative to Cython would be to use SWIG with numpy.i.
if you want to use only numpy and have everything computed at C level, try grouping all the records of same size together in different arrays and call numpy.std() on each of them. It should look like the following example.
example with O(N) complexity:
import numpy
list_size_1 = []
list_size_2 = []
for row in data.itervalues():
if len(row) == 1:
list_size_1.append(row)
elif len(row) == 2:
list_size_2.append(row)
list_size_1 = numpy.array(list_size_1)
list_size_2 = numpy.array(list_size_2)
std_1 = numpy.std(list_size_1, axis = 1)
std_2 = numpy.std(list_size_2, axis = 1)

While there are already some pretty reasonable ideas present here, I believe following is worth mentioning.
Filling missing data with any default value would spoil the statistical characteristics (std, etc). Evidently that's why Mapad proposed the nice trick with grouping same sized records.
The problem with it (assuming there isn't any a priori data on records lengths is at hand) is that it involves even more computations than the straightforward solution:
at least O(N*logN) 'len' calls and comparisons for sorting with an effective algorithm
O(N) checks on the second way through the list to obtain groups(their beginning and end indexes on the 'vertical' axis)
Using Psyco is a good idea (it's strikingly easy to use, so be sure to give it a try).
It seems that the optimal way is to take the strategy described by Mapad in bullet #1, but with a modification - not to generate the whole list, but iterate through the dictionary converting each row into numpy.array and performing required computations. Like this:
for row in data.itervalues():
np_row = numpy.array(row)
this_row_std = numpy.std(np_row)
# compute any other statistic descriptors needed and then save to some list
In any case a few million loops in python won't take as long as one might expect. Besides this doesn't look like a routine computation, so who cares if it takes extra second/minute if it is run once in a while or even just once.
A generalized variant of what was suggested by Mapad:
from numpy import array, mean, std
def get_statistical_descriptors(a):
if ax = len(shape(a))-1
functions = [mean, std]
return f(a, axis = ax) for f in functions
def process_long_list_stats(data):
import numpy
groups = {}
for key, row in data.iteritems():
size = len(row)
try:
groups[size].append(key)
except KeyError:
groups[size] = ([key])
results = []
for gr_keys in groups.itervalues():
gr_rows = numpy.array([data[k] for k in gr_keys])
stats = get_statistical_descriptors(gr_rows)
results.extend( zip(gr_keys, zip(*stats)) )
return dict(results)

numpy dictionary
You can use a structured array to preserve the ability to address a numpy object by a key, like a dictionary.
import numpy as np
dd = {'a':1,'b':2,'c':3}
dtype = eval('[' + ','.join(["('%s', float)" % key for key in dd.keys()]) + ']')
values = [tuple(dd.values())]
numpy_dict = np.array(values, dtype=dtype)
numpy_dict['c']
will now output
array([ 3.])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.