Applying a function along an axis of a dask array - python

I'm analyzing ocean temperature data from a climate model simulation where the 4D data arrays (time, depth, latitude, longitude; denoted dask_array below) typically have a shape of (6000, 31, 189, 192) and a size of ~25GB (hence my desire to use dask; I've been getting memory errors trying to process these arrays using numpy).
I need to fit a cubic polynomial along the time axis at each level / latitude / longitude point and store the resulting 4 coefficients. I've therefore set chunksize=(6000, 1, 1, 1) so I have a separate chunk for each grid point.
This is my function for getting the coefficients of the cubic polynomial (the time_axis axis values are a global 1D numpy array defined elsewhere):
def my_polyfit(data):
return numpy.polyfit(data.squeeze(), time_axis, 3)
(So in this case, numpy.polyfit returns a list of length 4)
and this is the command I thought I'd need to apply it to each chunk:
dask_array.map_blocks(my_polyfit, chunks=(4, 1, 1, 1), drop_axis=0, new_axis=0).compute()
Whereby the time axis is now gone (hence drop_axis=0) and there's a new coefficient axis in it's place (of length 4).
When I run this command I get IndexError: tuple index out of range, so I'm wondering where/how I've misunderstood the use of map_blocks?

I suspect that your experience will be smoother if your function returns an array of the same dimension that it consumes. E.g. you might consider defining your function as follows:
def my_polyfit(data):
return np.polyfit(data.squeeze(), ...)[:, None, None, None]
Then you can probably ignore the new_axis, drop_axis bits.
Performance-wise you might also want to consider using a larger chunksize. At 6000 numbers per chunk you have over a million chunks, which means you'll probably spend more time in scheduling than in actual computation. Generally I shoot for chunks that are a few megabytes in size. Of course, increasing chunksize would cause your mapped function to become more complex.
Example
In [1]: import dask.array as da
In [2]: import numpy as np
In [3]: def f(b):
return np.polyfit(b.squeeze(), np.arange(5), 3)[:, None, None, None]
...:
In [4]: x = da.random.random((5, 3, 3, 3), chunks=(5, 1, 1, 1))
In [5]: x.map_blocks(f, chunks=(4, 1, 1, 1)).compute()
Out[5]:
array([[[[ -1.29058580e+02, 2.21410738e+02, 1.00721521e+01],
[ -2.22469851e+02, -9.14889627e+01, -2.86405832e+02],
[ 1.40415805e+02, 3.58726232e+02, 6.47166710e+02]],
...

Kind of late to the party, but figured this could use an alternative answer based on new features in Dask. In particular, we added apply_along_axis, which behaves basically like NumPy's apply_along_axis except for Dask Arrays instead. This results in somewhat simpler syntax. Also it avoids the need to rechunk your data before applying your custom function to each 1-D piece and makes no real requirements of your initial chunking, which it tries to preserve in the end result (excepting the axis that is either reduced or replaced).
In [1]: import dask.array as da
In [2]: import numpy as np
In [3]: def f(b):
...: return np.polyfit(b, np.arange(len(b)), 3)
...:
In [4]: x = da.random.random((5, 3, 3, 3), chunks=(5, 1, 1, 1))
In [5]: da.apply_along_axis(f, 0, x).compute()
Out[5]:
array([[[[ 2.13570599e+02, 2.28924503e+00, 6.16369231e+01],
[ 4.32000311e+00, 7.01462518e+01, -1.62215514e+02],
[ 2.89466687e+02, -1.35522215e+02, 2.86643721e+02]],
...

Related

Reshape data after boolean indexes filtering

I have a data set called DATA which regroup several 3D tables from N=173 files of individual shape (4, 4, 64) so at the end the numpy array called DATA has shape (173, 4, 4, 64). In each individual file I have a column which is a boolean column to specify if the data is good or bad. In order to filter my data I use then boolean conditions:
cond = DATA[:,3,:,:]==False
DATA_filtered = DATA[:,1,:,:][cond]
with the following shapes:
np.shape(DATA)
Out[854]: (173, 4, 4, 64)
np.shape(cond)
Out[855]: (173, 4, 64)
But since I use this technique at then end I have a 1D array and all the structure of the initial DATA set is lost. One technique is to use the reshape function used for numpy.array but this technique works only if at then end the dimension remains the same. In the case where the boolean conditions induce tables of variable size, we do no longer can predict and ask for a reshape. So is there a way to filter data but keeping the global shape of the data with size which can vary depending on the flag used in the data?
Here is a minimal example:
TEST = np.ones((173,4,4,64))
FLAG = np.random.choice(a=[False, True], size=(173,4,64))
cond = FLAG==False
data = TEST[:,0,:,:][cond]
Output :
np.shape(data)
Out[868]: (22167,)
Expected Output:
np.shape(data)
Out[868]: (173,4,)
with for example data[:,1,:], a subset with non equals arrays size between 0 and 64 accross the 174 table depending of the data filtering which have been flagged or not.
Thank you in advance
Masked Array is your solution
In many circumstances, datasets can be incomplete or tainted by the presence of invalid data. For example, a sensor may have failed to record a data, or recorded an invalid value. The numpy.ma module provides a convenient way to address this issue, by introducing masked arrays.
A masked array is the combination of a standard numpy.ndarray and a mask
import numpy as np
import numpy.ma as ma
x = np.array([1, 2, 3, -1, 5])
mx = ma.masked_array(x, mask=[0, 0, 0, 1, 0])
mx.mean() # without taking the invalid data into account
Output
2.75
All the above taken from Masked array
So you might as well read it form there

Unlike Numpy, Pandas doesn't seem to like memory strides

Pandas seems to be missing a R-style matrix-level rolling window function (rollapply(..., by.column = FALSE)), providing only the vector based version. Thus I tried to follow this question and it works beautifully with the example which can be replicated, but it doesn't work with pandas DataFrames even when using the (seemingly identical) underlying Numpy array.
Artificial problem replication:
import numpy as np
import pandas as pd
from numpy.lib.stride_tricks import as_strided
test = [[x * y for x in range(1, 10)] for y in [10**z for z in range(5)]]
mm = np.array(test, dtype = np.int64)
pp = pd.DataFrame(test).values
mm and pp look identical:
The numpy directly-derived matrix gives me what I want perfectly:
as_strided(mm, (mm.shape[0] - 3 + 1, 3, mm.shape[1]), (mm.shape[1] * 8, mm.shape[1] * 8, 8))
That is, it gives me 3 strides of 3 rows each, in a 3d matrix, allowing me to perform computations on a submatrix moving down by one row at a time.
But the pandas-derived version (identical call with mm replaced by pp):
as_strided(pp, (pp.shape[0] - 3 + 1, 3, pp.shape[1]), (pp.shape[1] * 8, pp.shape[1] * 8, 8))
is all weird like it's transposed somehow. Is this to do with column/row major order stuff?
I need to do matrix sliding windows in Pandas, and this seems my best shot, especially because it is really fast. What's going on here? How do I get the underlying Pandas array to behave like Numpy?
It seems that the .values returns the underlying data in Fortran order (as you speculated):
>>> mm.flags # NumPy array
C_CONTIGUOUS : True
F_CONTIGUOUS : False
...
>>> pp.flags # array from DataFrame
C_CONTIGUOUS : False
F_CONTIGUOUS : True
...
This confuses as_strided which expects the data to be arranged in C order in memory.
To fix things, you could copy the data in C order and use the same strides as in your question:
pp = pp.copy('C')
Alternatively, if you want to avoid copying large amounts of data, adjust the strides to acknowledge the column-order layout of the data:
as_strided(pp, (pp.shape[0] - 3 + 1, 3, pp.shape[1]), (8, 8, pp.shape[0]*8))
Is this to do with column/row major order stuff?
Yes, see mm.strides and pp.strides.
How do I get the underlying Pandas array to behave like Numpy?
The Numpy array mm is "C-contiguous" and that's why the stride trick works. If you want to call the exact same code on the array underlying the DataFrame, you can use np.ascontiguousarray first. Or maybe it would be better to write the data windowing while taking the array strides and itemsize into account.

Numpy: transpose result of advanced indexing

>>> import numpy as np
>>> X = np.arange(27).reshape(3, 3, 3)
>>> x = [0, 1]
>>> X[x, x, :]
array([[ 0, 1, 2],
[12, 13, 14]])
I need to sum it along the 0 dimension but in the real world the matrix is huge and I would prefer to be summing it along -1 dimension which is faster due to memory layout. Hence I would like the result to be transposed:
array([[ 0, 12],
[ 1, 13],
[ 2, 14]])
How do I do that? I would like the result of numpy's "advanced indexing" to be implicitly transposed. Transposing it explicitly with .T at the end is even slower and is not an option.
Update1: in the real world advanced indexing is unavoidable and the subscripts are not guaranteed to be the same.
>>> x = [0, 0, 1]
>>> y = [0, 1, 1]
>>> X[x, y, :]
array([[ 0, 1, 2],
[ 3, 4, 5],
[12, 13, 14]])
Update2: To clarify that this is not an XY problem, here is the actual problem:
I have a large matrix X which contains elements x coming from some probability distribution. The probability distribution of the element depends on the neighbourhood of the element. This distribution is unknown so I follow the Gibbs sampling procedure to build a matrix which has elements from this distribution. In a nutshell it means that I make some initial guess for matrix X and then I keep iterating over the elements of matrix X updating each element x with a formula that depends on the neighbouring values of x. So, for any element of a matrix I need to get its neighbours (advanced indexing) and perform some operation on them (summation in my example). I have used line_profiler to see that the line which takes most of the time in my code is taking the sum of an array with respect to dimension 0 rather than -1. Hence I would like to know if there is a way to produce an already-transposed matrix as a result of advanced indexing.
I would like to sum it along the 0 dimension but in the real world the matrix is huge and I would prefer to be summing it along -1 dimension which is faster due to memory layout.
I'm not totally sure what you mean by this. If the underlying array is row-major (the default, i.e. X.flags.c_contiguous == True), then it may be slightly faster to sum it along the 0th dimension. Simply transposing an array using .T or np.transpose() does not, in itself, change how the array is laid out in memory.
For example:
# X is row-major
print(X.flags.c_contiguous)
# True
# Y is just a transposed view of X
Y = X.T
# the indices of the elements in Y are transposed, but their layout in memory
# is the same as in X, therefore Y is column-major rather than row-major
print(Y.flags.c_contiguous)
# False
You can convert from row-major to column-major, for example by using np.asfortranarray(X), but there is no way to perform this conversion without making a full copy of X in memory. Unless you're going to be performing lots of operations over the columns of X then it almost certainly won't be worthwhile doing the conversion.
If you want to store the result of your summation in a column-major array, you could use the out= kwarg to X.sum(), e.g.:
result = np.empty((3, 3), order='F') # Fortran-order, i.e. column-major
X.sum(0, out=result)
In your case the difference between summing over rows vs columns is likely to be very minimal, though - since you are already going to be indexing non-adjacent elements in X you will already be losing the benefit of spatial locality of reference that would normally make summing over rows slightly faster.
For example:
X = np.random.randn(100, 100, 100)
# summing over whole rows is slightly faster than summing over whole columns
%timeit X.sum(0)
# 1000 loops, best of 3: 438 µs per loop
%timeit X.T.sum(0)
# 1000 loops, best of 3: 486 µs per loop
# however, the locality advantage disappears when you are addressing
# non-adjacent elements using fancy indexing
%timeit X[[0, 0, 1], [0, 1, 1], :].sum()
# 100000 loops, best of 3: 4.72 µs per loop
%timeit X.T[[0, 0, 1], [0, 1, 1], :].sum()
# 100000 loops, best of 3: 4.63 µs per loop
Update
#senderle has mentioned in the comments that using numpy v1.6.2 he sees the opposite order for the timing, i.e. X.sum(-1) is faster than X.sum(0) for a row-major array. This seems to be related to the version of numpy he is using - using v1.6.2 I can reproduce the order that he observes, but using two newer versions (v1.8.2 and 1.10.0.dev-8bcb756) I observe the opposite (i.e. X.sum(0) is faster than X.sum(-1) by a small margin). Either way, I don't think it's likely that changing the memory order of the array is likely to help much for the OP's case.

Calculating long expressions using Numpy (coordinate transform)?

In Pythons Numpy module, is there a function that can calculate long/advanced math expressions on an array? I heard of the numexp module but want to stay clear of further dependencies.
Better yet, can I limit these expressions to only say the first or second element of the sub arrays within my array, without having to unpack them as separate arrays?
Here is my specific problem. I have an array of arrays containing geographic point coordinates looking like this: [[x1,y1],[x2,y2],[x3,y3],etc...]. What I want is to transform these geocoords to pixel coordinates so they can be drawn on an image. I therefore want to run the following expression/calculation on the first element of each subarray, ie the xs:
((180+X)/360)*screenwidthpixels
And on the second element, ie the ys:
((-90+Y)/180)*-screenheightpixels
These expressions would work in a python for-loop but is too slow, which is why I'm turning to Numpy. I know I can and have tried to just link numpys single math operator functions after each other but still too slow, and besides, to do that I first had to unpack all the xs and ys to separate arrays and repack them together after the calculation making it even slower.
So I guess I'm looking for a more direct Numpy way using less steps to transform my coordinate array using the expressions above. Any ideas?
import numpy as np
points = np.random.rand(10,2)
translation = np.array([180,-90])
scaling = np.array([1024, -768]) / np.array([360,180])
transformed_points = (points + translation) * scaling
This will do what you are looking for. It relies on numpy broadcasting rules to achieve expressiveness and performance.
But rather than explaining exactly how that works, I think you are better off finding yourself a good numpy primer, and starting at the top. numpy is one of the best things about python, and you cant go wrong learning a little more about it. Suffice to say, numpy is certainly up to the kind of task you are facing.
I'm a little confused because I'm not sure exactly what you're saying you already tried, or what the speed condition for success is.
Are you saying you already tried something like the following, but it is too slow?
arr = whatever
arr[:,0] = (arr[:,0] + 180) / (360 * screenwidthpixels)
arr[:,1] = 180 - (arr[:,1] - 90) / (180 * screenheightpixels)
I'm not sure what you mean by "having to unpack" to X and Y. Here's how you avoid unpacking (if i understand...)
arr = np.array([ [x1,y1], [x2,y2], [x3,y3] ])
arr.shape
=> (3, 2)
X = arr[:,0] # fast, creates a view
Y = arr[:,1] # fast too
((X+180)/360)/screenwidthpixels
Further speed up can be achieved by rewriting/simplifying your expressions.
((X+180)/360)/s => (X+180)/(360*s)
(180-((Y+90)/180))/s => (180/s-1/(2*s)) - y/(180*s)
In the first rewrite, you get 2 traverses of the array, instead of 3, and in the second, the array is only traversed twice, instead of 4 times.
In [235]: xs=arange(1000)
In [236]: ys=arange(1, 1001)
In [237]: a=array([xs, ys]).T
In [238]: a
Out[238]:
array([[ 0, 1],
[ 1, 2],
[ 2, 3],
...,
[ 997, 998],
[ 998, 999],
[ 999, 1000]])
In [240]: a[:, 0]=(a[:, 0]+180)/360/1024
the a[:, 0] offers a view of the first column of a, it's fast and memory saving. docs for numpy here

How can I use numpy to calculate a series effectively?

I want to create an array in numpy that contains the values of a mathematical series, in this example the square of the previous value, giving a single starting value, i.e. a_0 = 2, a_1 = 4, a_3 = 16, ...
Trying to use the vectorization in numpy I thought this might work:
import numpy as np
a = np.array([2,0,0,0,0])
a[1:] = a[0:-1]**2
but the outcome is
array([2, 4, 0, 0, 0])
I have learned now that numpy does internally create a temporary array for the output and in the end copies this array, that is why it fails for the values that are zero in the original array.
Is there a way to vectorize this function using numpy, numexpr or other tools? What other ways are there to effectively calculate the values of a series when fast numpy functions are available without going for a for loop?
There is no general way to vectorise recursive sequence definitions in NumPy. This particular case is rather easy to write without a for-loop though:
>>> 2 ** 2 ** numpy.arange(5)
array([ 2, 4, 16, 256, 65536])

Categories

Resources