Pandas seems to be missing a R-style matrix-level rolling window function (rollapply(..., by.column = FALSE)), providing only the vector based version. Thus I tried to follow this question and it works beautifully with the example which can be replicated, but it doesn't work with pandas DataFrames even when using the (seemingly identical) underlying Numpy array.
Artificial problem replication:
import numpy as np
import pandas as pd
from numpy.lib.stride_tricks import as_strided
test = [[x * y for x in range(1, 10)] for y in [10**z for z in range(5)]]
mm = np.array(test, dtype = np.int64)
pp = pd.DataFrame(test).values
mm and pp look identical:
The numpy directly-derived matrix gives me what I want perfectly:
as_strided(mm, (mm.shape[0] - 3 + 1, 3, mm.shape[1]), (mm.shape[1] * 8, mm.shape[1] * 8, 8))
That is, it gives me 3 strides of 3 rows each, in a 3d matrix, allowing me to perform computations on a submatrix moving down by one row at a time.
But the pandas-derived version (identical call with mm replaced by pp):
as_strided(pp, (pp.shape[0] - 3 + 1, 3, pp.shape[1]), (pp.shape[1] * 8, pp.shape[1] * 8, 8))
is all weird like it's transposed somehow. Is this to do with column/row major order stuff?
I need to do matrix sliding windows in Pandas, and this seems my best shot, especially because it is really fast. What's going on here? How do I get the underlying Pandas array to behave like Numpy?
It seems that the .values returns the underlying data in Fortran order (as you speculated):
>>> mm.flags # NumPy array
C_CONTIGUOUS : True
F_CONTIGUOUS : False
...
>>> pp.flags # array from DataFrame
C_CONTIGUOUS : False
F_CONTIGUOUS : True
...
This confuses as_strided which expects the data to be arranged in C order in memory.
To fix things, you could copy the data in C order and use the same strides as in your question:
pp = pp.copy('C')
Alternatively, if you want to avoid copying large amounts of data, adjust the strides to acknowledge the column-order layout of the data:
as_strided(pp, (pp.shape[0] - 3 + 1, 3, pp.shape[1]), (8, 8, pp.shape[0]*8))
Is this to do with column/row major order stuff?
Yes, see mm.strides and pp.strides.
How do I get the underlying Pandas array to behave like Numpy?
The Numpy array mm is "C-contiguous" and that's why the stride trick works. If you want to call the exact same code on the array underlying the DataFrame, you can use np.ascontiguousarray first. Or maybe it would be better to write the data windowing while taking the array strides and itemsize into account.
Related
I'm using numpy and want to index a row without losing the dimension information.
import numpy as np
X = np.zeros((100,10))
X.shape # >> (100, 10)
xslice = X[10,:]
xslice.shape # >> (10,)
In this example xslice is now 1 dimension, but I want it to be (1,10).
In R, I would use X[10,:,drop=F]. Is there something similar in numpy. I couldn't find it in the documentation and didn't see a similar question asked.
Thanks!
Another solution is to do
X[[10],:]
or
I = array([10])
X[I,:]
The dimensionality of an array is preserved when indexing is performed by a list (or an array) of indexes. This is nice because it leaves you with the choice between keeping the dimension and squeezing.
It's probably easiest to do x[None, 10, :] or equivalently (but more readable) x[np.newaxis, 10, :]. None or np.newaxis increases the dimension of the array by 1, so that you're back to the original after the slicing eliminates a dimension.
As far as why it's not the default, personally, I find that constantly having arrays with singleton dimensions gets annoying very quickly. I'd guess the numpy devs felt the same way.
Also, numpy handle broadcasting arrays very well, so there's usually little reason to retain the dimension of the array the slice came from. If you did, then things like:
a = np.zeros((100,100,10))
b = np.zeros(100,10)
a[0,:,:] = b
either wouldn't work or would be much more difficult to implement.
(Or at least that's my guess at the numpy dev's reasoning behind dropping dimension info when slicing)
I found a few reasonable solutions.
1) use numpy.take(X,[10],0)
2) use this strange indexing X[10:11:, :]
Ideally, this should be the default. I never understood why dimensions are ever dropped. But that's a discussion for numpy...
Here's an alternative I like better. Instead of indexing with a single number, index with a range. That is, use X[10:11,:]. (Note that 10:11 does not include 11).
import numpy as np
X = np.zeros((100,10))
X.shape # >> (100, 10)
xslice = X[10:11,:]
xslice.shape # >> (1,10)
This makes it easy to understand with more dimensions too, no None juggling and figuring out which axis to use which index. Also no need to do extra bookkeeping regarding array size, just i:i+1 for any i that you would have used in regular indexing.
b = np.ones((2, 3, 4))
b.shape # >> (2, 3, 4)
b[1:2,:,:].shape # >> (1, 3, 4)
b[:, 2:3, :].shape . # >> (2, 1, 4)
To add to the solution involving indexing by lists or arrays by gnebehay, it is also possible to use tuples:
X[(10,),:]
This is especially annoying if you're indexing by an array that might be length 1 at runtime. For that case, there's np.ix_:
some_array[np.ix_(row_index,column_index)]
I've been using np.reshape to achieve the same as shown below
import numpy as np
X = np.zeros((100,10))
X.shape # >> (100, 10)
xslice = X[10,:].reshape(1, -1)
xslice.shape # >> (1, 10)
I'm analyzing ocean temperature data from a climate model simulation where the 4D data arrays (time, depth, latitude, longitude; denoted dask_array below) typically have a shape of (6000, 31, 189, 192) and a size of ~25GB (hence my desire to use dask; I've been getting memory errors trying to process these arrays using numpy).
I need to fit a cubic polynomial along the time axis at each level / latitude / longitude point and store the resulting 4 coefficients. I've therefore set chunksize=(6000, 1, 1, 1) so I have a separate chunk for each grid point.
This is my function for getting the coefficients of the cubic polynomial (the time_axis axis values are a global 1D numpy array defined elsewhere):
def my_polyfit(data):
return numpy.polyfit(data.squeeze(), time_axis, 3)
(So in this case, numpy.polyfit returns a list of length 4)
and this is the command I thought I'd need to apply it to each chunk:
dask_array.map_blocks(my_polyfit, chunks=(4, 1, 1, 1), drop_axis=0, new_axis=0).compute()
Whereby the time axis is now gone (hence drop_axis=0) and there's a new coefficient axis in it's place (of length 4).
When I run this command I get IndexError: tuple index out of range, so I'm wondering where/how I've misunderstood the use of map_blocks?
I suspect that your experience will be smoother if your function returns an array of the same dimension that it consumes. E.g. you might consider defining your function as follows:
def my_polyfit(data):
return np.polyfit(data.squeeze(), ...)[:, None, None, None]
Then you can probably ignore the new_axis, drop_axis bits.
Performance-wise you might also want to consider using a larger chunksize. At 6000 numbers per chunk you have over a million chunks, which means you'll probably spend more time in scheduling than in actual computation. Generally I shoot for chunks that are a few megabytes in size. Of course, increasing chunksize would cause your mapped function to become more complex.
Example
In [1]: import dask.array as da
In [2]: import numpy as np
In [3]: def f(b):
return np.polyfit(b.squeeze(), np.arange(5), 3)[:, None, None, None]
...:
In [4]: x = da.random.random((5, 3, 3, 3), chunks=(5, 1, 1, 1))
In [5]: x.map_blocks(f, chunks=(4, 1, 1, 1)).compute()
Out[5]:
array([[[[ -1.29058580e+02, 2.21410738e+02, 1.00721521e+01],
[ -2.22469851e+02, -9.14889627e+01, -2.86405832e+02],
[ 1.40415805e+02, 3.58726232e+02, 6.47166710e+02]],
...
Kind of late to the party, but figured this could use an alternative answer based on new features in Dask. In particular, we added apply_along_axis, which behaves basically like NumPy's apply_along_axis except for Dask Arrays instead. This results in somewhat simpler syntax. Also it avoids the need to rechunk your data before applying your custom function to each 1-D piece and makes no real requirements of your initial chunking, which it tries to preserve in the end result (excepting the axis that is either reduced or replaced).
In [1]: import dask.array as da
In [2]: import numpy as np
In [3]: def f(b):
...: return np.polyfit(b, np.arange(len(b)), 3)
...:
In [4]: x = da.random.random((5, 3, 3, 3), chunks=(5, 1, 1, 1))
In [5]: da.apply_along_axis(f, 0, x).compute()
Out[5]:
array([[[[ 2.13570599e+02, 2.28924503e+00, 6.16369231e+01],
[ 4.32000311e+00, 7.01462518e+01, -1.62215514e+02],
[ 2.89466687e+02, -1.35522215e+02, 2.86643721e+02]],
...
I want to create an array in numpy that contains the values of a mathematical series, in this example the square of the previous value, giving a single starting value, i.e. a_0 = 2, a_1 = 4, a_3 = 16, ...
Trying to use the vectorization in numpy I thought this might work:
import numpy as np
a = np.array([2,0,0,0,0])
a[1:] = a[0:-1]**2
but the outcome is
array([2, 4, 0, 0, 0])
I have learned now that numpy does internally create a temporary array for the output and in the end copies this array, that is why it fails for the values that are zero in the original array.
Is there a way to vectorize this function using numpy, numexpr or other tools? What other ways are there to effectively calculate the values of a series when fast numpy functions are available without going for a for loop?
There is no general way to vectorise recursive sequence definitions in NumPy. This particular case is rather easy to write without a for-loop though:
>>> 2 ** 2 ** numpy.arange(5)
array([ 2, 4, 16, 256, 65536])
I'm looking for dynamically growing vectors in Python, since I don't know their length in advance. In addition, I would like to calculate distances between these sparse vectors, preferably using the distance functions in scipy.spatial.distance (although any other suggestions are welcome). Any ideas how to do this? (Initially, it doesn't need to be efficient.)
Thanks a lot in advance!
You can use regular python lists (which are dynamic) as vectors. Trivial example follows.
from scipy.spatial.distance import sqeuclidean
a = [1,2,3]
b = [0,0,0]
print sqeuclidean(a,b) # 14
As per aganders3's suggestion, do note that you can also use numpy arrays if needed:
import numpy
a = numpy.array([1,2,3])
If the sparse part of your question is crucial I'd use scipy for that - it has support for sparse matrixes. You can define a 1xn matrix and use it as a vector. This works (the parameter is the size of the matrix, filled with zeroes by default):
sqeuclidean(scipy.sparse.coo_matrix((1,3)),scipy.sparse.coo_matrix((1,3))) # 0
There are many kinds of sparse matrixes, some dictionary based (see comment). You can define a row sparse matrix from a list like this:
scipy.sparse.csr_matrix([1,2,3])
Here is how you can do it in numpy:
import numpy as np
a = np.array([1, 2, 3])
b = np.array([0, 0, 0])
c = np.sum(((a - b) ** 2)) # 14
I'm using numpy and want to index a row without losing the dimension information.
import numpy as np
X = np.zeros((100,10))
X.shape # >> (100, 10)
xslice = X[10,:]
xslice.shape # >> (10,)
In this example xslice is now 1 dimension, but I want it to be (1,10).
In R, I would use X[10,:,drop=F]. Is there something similar in numpy. I couldn't find it in the documentation and didn't see a similar question asked.
Thanks!
Another solution is to do
X[[10],:]
or
I = array([10])
X[I,:]
The dimensionality of an array is preserved when indexing is performed by a list (or an array) of indexes. This is nice because it leaves you with the choice between keeping the dimension and squeezing.
It's probably easiest to do x[None, 10, :] or equivalently (but more readable) x[np.newaxis, 10, :]. None or np.newaxis increases the dimension of the array by 1, so that you're back to the original after the slicing eliminates a dimension.
As far as why it's not the default, personally, I find that constantly having arrays with singleton dimensions gets annoying very quickly. I'd guess the numpy devs felt the same way.
Also, numpy handle broadcasting arrays very well, so there's usually little reason to retain the dimension of the array the slice came from. If you did, then things like:
a = np.zeros((100,100,10))
b = np.zeros(100,10)
a[0,:,:] = b
either wouldn't work or would be much more difficult to implement.
(Or at least that's my guess at the numpy dev's reasoning behind dropping dimension info when slicing)
I found a few reasonable solutions.
1) use numpy.take(X,[10],0)
2) use this strange indexing X[10:11:, :]
Ideally, this should be the default. I never understood why dimensions are ever dropped. But that's a discussion for numpy...
Here's an alternative I like better. Instead of indexing with a single number, index with a range. That is, use X[10:11,:]. (Note that 10:11 does not include 11).
import numpy as np
X = np.zeros((100,10))
X.shape # >> (100, 10)
xslice = X[10:11,:]
xslice.shape # >> (1,10)
This makes it easy to understand with more dimensions too, no None juggling and figuring out which axis to use which index. Also no need to do extra bookkeeping regarding array size, just i:i+1 for any i that you would have used in regular indexing.
b = np.ones((2, 3, 4))
b.shape # >> (2, 3, 4)
b[1:2,:,:].shape # >> (1, 3, 4)
b[:, 2:3, :].shape . # >> (2, 1, 4)
To add to the solution involving indexing by lists or arrays by gnebehay, it is also possible to use tuples:
X[(10,),:]
This is especially annoying if you're indexing by an array that might be length 1 at runtime. For that case, there's np.ix_:
some_array[np.ix_(row_index,column_index)]
I've been using np.reshape to achieve the same as shown below
import numpy as np
X = np.zeros((100,10))
X.shape # >> (100, 10)
xslice = X[10,:].reshape(1, -1)
xslice.shape # >> (1, 10)