Reshaping a dask.array in Fortran-contiguous order - python

I would like to ask if there is a way how to reshape a dask array in Fortran-contiguous (column-major) order since the parallelized version of the np.reshape function is not supported yet (see here).

Fortran-contiguous (column-major) order is simply C-contiguous (row-major) order in reverse. So there's a simple work around for the fact that dask array doesn't support order='F':
Transpose your array to reverse its dimensions.
Reshape it to the reverse of your desired shape.
Transpose it back.
In a function:
def reshape_fortran(x, shape):
return x.T.reshape(shape[::-1]).T
Transposing with NumPy/dask is basically free (it doesn't copy any data), so in principle this operation should also be quite efficient.
Here's a simple test to verify it does the right thing:
In [48]: import numpy as np
In [49]: import dask.array as da
In [50]: x = np.arange(100).reshape(10, 10)
In [51]: y = da.from_array(x, chunks=5)
In [52]: shape = (2, 5, 10)
In [53]: np.array_equal(reshape_fortran(y, shape).compute(),
...: x.reshape(shape, order='F'))
...:
Out[53]: True

Related

Extend 1d numpy array in multiple dimensions

I have a 1d numpy array, e.g. a=[10,12,15] and I want to extend it so that I end up with a numpy array b with the shape (3,10,15,20) filled with a so that e.g. b[:,1,1,1] is [10,12,15].
I thought of using np.repeat but it's not clear to me how to do ?
tile will do it for you. Internally this does a repeat for each axis.
In [114]: a = np.array([10,12,15])
In [115]: A = np.tile(a.reshape(3,1,1,1),(1,10,15,20))
In [116]: A.shape
Out[116]: (3, 10, 15, 20)
In [117]: A[:,1,1,1]
Out[117]: array([10, 12, 15])
For some purposes it might be enough to just do the reshape and let broadcasting expand the dimensions as needed (without actually expanding memory use).
Code:
import numpy as np
a = np.arange(1800).reshape((10,12,15))
b = np.repeat(a, repeats=5, axis=0).reshape(((3,10,15,20)))
You can change axis if you want to repeat in a different fashion. To understand repeat use lower shape for e.g. a(3,5,4) and b (2,3,5,4) and repeat on different axis.

Is there an alternative, vectored way to write the to_array function?

Suppose we have a ragged, nested sequence like the following:
import numpy as np
x = np.ones((10, 20))
y = np.zeros((10, 20))
a = [[0, x], [y, 1]]
and want to create a full numpy array that broadcasts the ragged sub-sequences (to match the maximum dimension of any other sub-sequence, in this case (10,20)) where necessary. First, we might try to use np.array(a), which yields the warning:
VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
By changing to np.array(a, dtype=object), we do get an array. However, this is an array of objects rather than floats, and retains the ragged subsequences, which have not been broadcasted as desired. To fix this, I created a new function to_array which takes a (possibly ragged, nested) sequence and a shape and returns a full numpy array of that shape:
def to_array(a, shape):
a = np.array(a, dtype=object)
b = np.empty(shape)
for index in np.ndindex(a.shape):
b[index] = a[index]
return b
b = np.array(a, dtype=object)
c = to_array(a, (2, 2, 10, 20))
print(b.shape, b.dtype) # prints (2, 2) object
print(c.shape, c.dtype) # prints (2, 2, 10, 20) float64
Note that c, not b, is the desired result. However, to_array relies on a for loop over nindex, and Python for loops are slow for big arrays.
Is there an alternative, vectorized way to write the to_array function?
Given the target shape, a few iterations doesn't seem overly expensive:
In [35]: C = np.empty((A.shape+x.shape), x.dtype)
In [36]: for idx in np.ndindex(A.shape):
...: C[idx] = A[idx]
...:
Alternatively you could replace the 0 and 1 with the appropraite (10,20) arrays. Here you've already created those, x and y:
In [37]: D = np.array([[y,x],[y,x]])
In [38]: np.allclose(C,D)
Out[38]: True
In general a few iterations on a complex task are ok. Keep in mind that (many) operations on an object dtype array are actually slower than operations on an equivalent list. It's the whole-array compiled operations on a numeric array that are relatively fast. That's not your case.
But
C[0,0,:,:] = 0
uses broadcasting - all (10,20) values of C[0,0] are filled with the scalar 0 via broadcasting.
C[0,1,:,:] = x
is a different broadcasting, where the RHS matches the left. It's unreasonable to expect numpy to handle both cases with one broadcasting operation.

How do I apply a smoothing filter with dask

I have a 2 dimensional array, I would like to do a 2 dimensional convolution with a kernel, for example a simple flat square matrix.
See for example: http://nbviewer.jupyter.org/gist/zonca/f0d819048ef7318eff944396b71af1c4
Is there a way to run this multithreaded with dask?
The map_overlap method may do what you want. It allows you to map a function over chunks of your array where those chunks have been pre-buffered with an overlapping region from nearby chunks.
Something like the following might be a good start:
In [1]: import numpy as np
In [2]: x = np.random.normal(10, 1, size=(1000, 1000))
In [3]: from scipy.signal import convolve2d
In [4]: import dask.array as da
In [5]: d = da.from_array(x, chunks=(500, 500))
In [6]: filt = np.ones((8, 8))
In [7]: d.map_overlap(convolve2d, in2=filt, depth=8)
Out[7]: dask.array<trim-de..., shape=(1000, 1000), dtype=None, chunksize=(500, 500)>
Although beware that the filter you've provided both smooths and amplifies. You might also need to play with the trimming in convolve2d.

Append numpy array into an element

I have a Numpy array of shape (5,5,3,2). I want to take the element (1,4) of that matrix, which is also a matrix of shape (3,2), and add an element to it -so it becomes a (4,2) array.
The code I'm using is the following:
import numpy as np
a = np.random.rand(5,5,3,2)
a = np.array(a, dtype = object) #So I can have different size sub-matrices
a[2][3] = np.append(a[2][3],[[1.0,1.0]],axis=0) #a[2][3] shape = (3,2)
I'm always obtaining the error:
ValueError: could not broadcast input array from shape (4,2) into shape (3,2)
I understand that the shape returned by the np.append function is not the same as the a[2][3] sub-array, but I thought that the dtype=object would solve my problem. However, I need to do this. Is there any way to go around this limitation?
I also tried to use the insert function but I don't know how could I add the element in the place I want.
Make sure you understand what you have produced. That requires checking the shape and dtype, and possibly looking at the values
In [29]: a = np.random.rand(5,5,3,2)
In [30]: b=np.array(a, dtype=object)
In [31]: a.shape
Out[31]: (5, 5, 3, 2) # a is a 4d array
In [32]: a.dtype
Out[32]: dtype('float64')
In [33]: b.shape
Out[33]: (5, 5, 3, 2) # so is b
In [34]: b.dtype
Out[34]: dtype('O')
In [35]: b[2,3].shape
Out[35]: (3, 2)
In [36]: c=np.append(b[2,3],[[1,1]],axis=0)
In [37]: c.shape
Out[37]: (4, 2)
In [38]: c.dtype
Out[38]: dtype('O')
b[2][3] is also an array. b[2,3] is the proper numpy way of indexing 2 dimensions.
I suspect you wanted b to be a (5,5) array containing arrays (as objects), and you think that you you can simply replace one of those with a (4,2) array. But the b constructor simply changes the floats of a to objects, without changing the shape (or 4d nature) of b.
I could construct a (5,5) object array, and fill it with values from a. And then replace one of those values with a (4,2) array:
In [39]: B=np.empty((5,5),dtype=object)
In [40]: for i in range(5):
...: for j in range(5):
...: B[i,j]=a[i,j,:,:]
...:
In [41]: B.shape
Out[41]: (5, 5)
In [42]: B.dtype
Out[42]: dtype('O')
In [43]: B[2,3]
Out[43]:
array([[ 0.03827568, 0.63411023],
[ 0.28938383, 0.7951006 ],
[ 0.12217603, 0.304537 ]])
In [44]: B[2,3]=c
In [46]: B[2,3].shape
Out[46]: (4, 2)
This constructor for B is a bit crude. I've answered other questions about creating/filling object arrays, but I'm not going to take the time here to streamline this case. It's for illustration purposes only.
In an array of object, any element can be indeed an array (or any kind of object).
import numpy as np
a = np.random.rand(5,5,3,2)
a = np.array(a, dtype=object)
# Assign an 1D array to the array element ``a[2][3][0][0]``:
a[2][3][0][0] = np.arange(10)
a[2][3][0][0][9] # 9
However a[2][3] is not an array element, it is a whole array.
a[2][3].ndim # 2
Therefore when you do a[2][3] = (something) you are using broadcasting instead of assigning an element: numpy tries to replace the content of the subarray a[2][3] and fails because of shape mismatch. The memory layout of numpy arrays does not allow to change the shape of subarrays.
Edit: Instead of using numpy arrays you could use nested lists. These nested lists can have arbitrary sizes. Note that the memory is higher and that the access time is higher compared to numpy array.
import numpy as np
a = np.random.rand(5,5,3,2)
a = np.array(a, dtype=object)
b = np.append(a[2][3], [[1.0,1.0]],axis=0)
a_list = a.tolist()
a_list[2][3] = b.tolist()
The problem here, is that you try to assign to a[2][3]
Make a new array instead.
new_array = np.append(a[2][3],np.array([[1.0,1.0]]),axis=0)

Unlike Numpy, Pandas doesn't seem to like memory strides

Pandas seems to be missing a R-style matrix-level rolling window function (rollapply(..., by.column = FALSE)), providing only the vector based version. Thus I tried to follow this question and it works beautifully with the example which can be replicated, but it doesn't work with pandas DataFrames even when using the (seemingly identical) underlying Numpy array.
Artificial problem replication:
import numpy as np
import pandas as pd
from numpy.lib.stride_tricks import as_strided
test = [[x * y for x in range(1, 10)] for y in [10**z for z in range(5)]]
mm = np.array(test, dtype = np.int64)
pp = pd.DataFrame(test).values
mm and pp look identical:
The numpy directly-derived matrix gives me what I want perfectly:
as_strided(mm, (mm.shape[0] - 3 + 1, 3, mm.shape[1]), (mm.shape[1] * 8, mm.shape[1] * 8, 8))
That is, it gives me 3 strides of 3 rows each, in a 3d matrix, allowing me to perform computations on a submatrix moving down by one row at a time.
But the pandas-derived version (identical call with mm replaced by pp):
as_strided(pp, (pp.shape[0] - 3 + 1, 3, pp.shape[1]), (pp.shape[1] * 8, pp.shape[1] * 8, 8))
is all weird like it's transposed somehow. Is this to do with column/row major order stuff?
I need to do matrix sliding windows in Pandas, and this seems my best shot, especially because it is really fast. What's going on here? How do I get the underlying Pandas array to behave like Numpy?
It seems that the .values returns the underlying data in Fortran order (as you speculated):
>>> mm.flags # NumPy array
C_CONTIGUOUS : True
F_CONTIGUOUS : False
...
>>> pp.flags # array from DataFrame
C_CONTIGUOUS : False
F_CONTIGUOUS : True
...
This confuses as_strided which expects the data to be arranged in C order in memory.
To fix things, you could copy the data in C order and use the same strides as in your question:
pp = pp.copy('C')
Alternatively, if you want to avoid copying large amounts of data, adjust the strides to acknowledge the column-order layout of the data:
as_strided(pp, (pp.shape[0] - 3 + 1, 3, pp.shape[1]), (8, 8, pp.shape[0]*8))
Is this to do with column/row major order stuff?
Yes, see mm.strides and pp.strides.
How do I get the underlying Pandas array to behave like Numpy?
The Numpy array mm is "C-contiguous" and that's why the stride trick works. If you want to call the exact same code on the array underlying the DataFrame, you can use np.ascontiguousarray first. Or maybe it would be better to write the data windowing while taking the array strides and itemsize into account.

Categories

Resources