How to prepend a value before xarray.DataArray.diff? - python

Given a xarray.DataArray with area values ('Af') along date and depth dimensions: xarray.DataArray 'Af' (time: 366, z: 20) , how to I get the difference between each area across the depth dimension for each date with the same length as 'area' and the first difference being equal to the first indexed 'area'?
So something like:
area_1 = Af_1
area_2 = Af_1 - Af_2
...
area_i = Af_i - Af_(i-1)
If it was in numpy, I could have used np.diff(array, prepend=array[0]), but the prepend option is not available in xarray. Is there any method to imitate np.diff(array, prepend=0) in xarray?
I am new to xarray (and coding in general), so any help would be appreciated.

If I understand you correctly, you want to do the following:
Compute the differences of the areas along the z-dimension. This will necessarily result in a coordinate that is shorter than before.
Then you want to prepend the areas at the first depth to get an array of the same length.
You can us xarray's diff method to do the first step and use concat to do the second step.
import numpy as np
import xarray as xr
# Create a dummy dataset
da = xr.DataArray(
data=np.random.rand(3, 4),
dims=("time", "z"),
name="area",
coords={"time": np.arange(3), "z": np.arange(4)},
)
# Compute the differences
differences = da.diff(dim="z", label="upper")
# Concatenate the differences with the areas at the first depth
xr.concat([da.isel(z=0), differences], dim="z")
I think that this is the easiest way, because you can take advantage of xarray's labeled dimensions.
However, you can actually use any numpy function on a xarray DataArray. So you could use np.diff as well. However, you will end up with an unlabeled numpy array, so you would have to readd dimensions and coordinates.
The easiest way to do so is the copy method using the data argument. It will create an array with the same structure as the original one (same dims and coords) but with different data:
# Compute diffs with numpy and write the results back into a DataArray
da.copy(data=np.diff(da, axis=da.get_axis_num("z"), prepend=0))

Related

Computing grid computations using numpy meshgrid

I have used numpy meshgrids for a long time, and typically find no issues when trying to pass that meshgrid through a function. In my experience it has always been the case that I can define my coordinate space as
x,y,z = numpy.meshgrid(numpy.linspace(-10,10,10),
numpy.linspace(-10,10,10),
numpy.linspace(-10,10,10))
and then can easily compute something like
u,v,w = numpy.sin(x*y)+numpy.cos(z).
My issue has arisen from the need to do a cross product in that calculation. I am defining a field using the meshgrid, and trying to pass the entire meshgrid through the function:
field_equation = lambda x,y,z: sum([parameter*np.cross([wire_x[i],wire_y[i],wire_z[i]],[x,y,z]) for i in range(len(wire))])
Depending on how I try to solve the problem, I get a whole host of problems. The code works fine when passing individual points (x,y,z) through one at a time, but cannot calculate for the entire field. How do I get around this?
np.cross only accept a vector of size 3, or nd-array with the last dimension of size 3, so we need to stack np.stack([x,y,z]) to create a 10*10*10*3 nd-array first.
The results will be a 10*10*10*3 array, and to be able to unpack this array later, we need to transpose it to size 3*10*10*10, so I swap axes of resulting array at the end.
In the code below, I also take the liberty to shorten the code wrt wire a little, assuming wire_x, wire_y, wire_z are just 3 components of wire.
import numpy as np
# test data
x,y,z = np.meshgrid(np.linspace(-10,10,10),
np.linspace(-10,10,10),
np.linspace(-10,10,10))
wire = [[1,2,3,4], [5,6,7,8], [3,4,5,6]]
parameter = 1
field_equation = lambda x,y,z: sum([parameter*np.cross(w, np.stack([x,y,z], axis=-1)) for w in zip(*wire)]).swapaxes(0,-1)
a,b,c = field_equation(x,y,z)
print(a.shape, b.shape, c.shape)
#(10, 10, 10) (10, 10, 10) (10, 10, 10)

construct a large dask-backed xarray from an iterator of row vectors

How can I build xarray from from an iterator of row vectors.
The resulting array may be larger than memory and will be backed by a dask array.
The row vectors also come with unique labels which need to become the row index of the resulting xarray.
In the docs I only see a constructor that takes an in memory numpy array to begin with.
An example use case would be to store a word embedding model as an xarray with words as row labels. These models usually provide an iterator that produces (string, vector) pairs over all words in the vocabulary. Most models have have in the 100s of dimensions and there are usually ~10^6 words in the vocabulary. I would like to stack the vectors into a matrix in order to perform linear algebra operations and also be able to look up rows by the word string.
I would expect to be able to write something like:
import numpy as np
import xarray as xr
vectors = (('V'+str(i), np.random.randn(10000)) for i in range(10**9))
xray = xarray_from_iter(vectors)
xray.to_parquet('big_xarray.parquet')
row1234567 = xray['V1234567']
Does xarray provide something like xarray_from_iter?
If not how do I write it?
xarray_from_iter should work something like numpy.fromiter
except that it should also label the rows as it goes.
It would also need to delay the computation until dump was called,
since the whole issue is that the that array is larger than memory.
TLDR; xarray does not have a from iterator constructor. You'll have to build your dask arrays yourself.
Also, xarray does not have a to_parquet method so that is not an operation you can do (at the moment).
Here is an example of how you might construct a dask array (and xarray.DataArray) for your use case:
import dask.array
import xarray as xr
import numpy as np
num = 10
names = []
arrays = []
for i in range(num):
names.append('V'+str(i))
arrays.append(dask.array.random.random(10000, chunks=(1000,)))
da = xr.DataArray(data, dims=('model', 'sample'), coords={'model': names})
print(da)
Yielding:
<xarray.DataArray 'stack-ff07239b7ea24834ba59f2d05b7f41e2' (model: 10,
sample: 10000)>
dask.array<shape=(10, 10000), dtype=float64, chunksize=(1, 1000)>
Coordinates:
* model (model) <U2 'V0' 'V1' 'V2' 'V3' 'V4' 'V5' 'V6' 'V7' 'V8' 'V9'
Dimensions without coordinates: sample
This is not likely to be efficient, especially when the length of the iterator gets large (like in your example). It may be worth proposing such a constructor on the dask github issues page.

np.bincount for 1 line, vectorized multidimensional averaging

I am trying to vectorize an operation using numpy, which I use in a python script that I have profiled, and found this operation to be the bottleneck and so needs to be optimized since I will run it many times.
The operation is on a data set of two parts. First, a large set (n) of 1D vectors of different lengths (with maximum length, Lmax) whose elements are integers from 1 to maxvalue. The set of vectors is arranged in a 2D array, data, of size (num_samples,Lmax) with trailing elements in each row zeroed. The second part is a set of scalar floats, one associated with each vector, that I have a computed and which depend on its length and the integer-value at each position. The set of scalars is made into a 1D array, Y, of size num_samples.
The desired operation is to form the average of Y over the n samples, as a function of (value,position along length,length).
This entire operation can be vectorized in matlab with use of the accumarray function: by using 3 2D arrays of the same size as data, whose elements are the corresponding value, position, and length indices of the desired final array:
sz_Y = num_samples;
sz_len = Lmax
sz_pos = Lmax
sz_val = maxvalue
ind_len = repmat( 1:sz_len ,1 ,sz_samples);
ind_pos = repmat( 1:sz_pos ,sz_samples,1 );
ind_val = data
ind_Y = repmat((1:sz_Y)',1 ,Lmax );
copiedY=Y(ind_Y);
mask = data>0;
finalarr=accumarray({ind_val(mask),ind_pos(mask),ind_len(mask)},copiedY(mask), [sz_val sz_pos sz_len])/sz_val;
I was hoping to emulate this implementation with np.bincounts. However, np.bincounts differs to accumarray in two relevant ways:
both arguments must be of same 1D size, and
there is no option to choose the shape of the output array.
In the above usage of accumarray, the list of indices, {ind_val(mask),ind_pos(mask),ind_len(mask)}, is 1D cell array of 1x3 arrays used as index tuples, while in np.bincounts it must be 1D scalars as far as I understand. I expect np.ravel may be useful but am not sure how to use it here to do what I want. I am coming to python from matlab and some things do not translate directly, e.g. the colon operator which ravels in opposite order to ravel. So my question is how might I use np.bincount or any other numpy method to achieve an efficient python implementation of this operation.
EDIT: To avoid wasting time: for these multiD index problems with complicated index manipulation, is the recommend route to just use cython to implement the loops explicity?
EDIT2: Alternative Python implementation I just came up with.
Here is a heavy ram solution:
First precalculate:
Using index units for length (i.e., length 1 =0) make a 4D bool array, size (num_samples,Lmax+1,Lmax+1,maxvalue) , holding where the conditions are satisfied for each value in Y.
ALLcond=np.zeros((num_samples,Lmax+1,Lmax+1,maxvalue+1),dtype='bool')
for l in range(Lmax+1):
for i in range(Lmax+1):
for v in range(maxvalue+!):
ALLcond[:,l,i,v]=(data[:,i]==v) & (Lvec==l)`
Where Lvec=[len(row) for row in data]. Then get the indices for these using np.where and initialize a 4D float array into which you will assign the values of Y:
[indY,ind_len,ind_pos,ind_val]=np.where(ALLcond)
Yval=np.zeros(np.shape(ALLcond),dtype='float')
Now in the loop in which I have to perform the operation, I compute it with the two lines:
Yval[ind_Y,ind_len,ind_pos,ind_val]=Y[ind_Y]
Y_avg=sum(Yval)/num_samples
This gives a factor of 4 or so speed up over the direct loop implementation. I was expecting more. Perhaps, this is a more tangible implementation for Python heads to digest. Any faster suggestions are welcome :)
One way is to convert the 3 "indices" to a linear index and then apply bincount. Numpy's ravel_multi_index is essentially the same as MATLAB's sub2ind. So the ported code could be something like:
shape = (Lmax+1, Lmax+1, maxvalue+1)
posvec = np.arange(1, Lmax+1)
ind_len = np.tile(Lvec[:,None], [1, Lmax])
ind_pos = np.tile(posvec, [n, 1])
ind_val = data
Y_copied = np.tile(Y[:,None], [1, Lmax])
mask = posvec <= Lvec[:,None] # fill-value independent
lin_idx = np.ravel_multi_index((ind_len[mask], ind_pos[mask], ind_val[mask]), shape)
Y_avg = np.bincount(lin_idx, weights=Y_copied[mask], minlength=np.prod(shape)) / n
Y_avg.shape = shape
This is assuming data has shape (n, Lmax), Lvec is Numpy array, etc. You may need to adapt the code a little to get rid of off-by-one errors.
One could argue that the tile operations are not very efficient and not very "numpythonic". Something with broadcast_arrays could be nice, but I think I prefer this way:
shape = (Lmax+1, Lmax+1, maxvalue+1)
posvec = np.arange(1, Lmax+1)
len_idx = np.repeat(Lvec, Lvec)
pos_idx = np.broadcast_to(posvec, data.shape)[mask]
val_idx = data[mask]
Y_copied = np.repeat(Y, Lvec)
mask = posvec <= Lvec[:,None] # fill-value independent
lin_idx = np.ravel_multi_index((len_idx, pos_idx, val_idx), shape)
Y_avg = np.bincount(lin_idx, weights=Y_copied, minlength=np.prod(shape)) / n
Y_avg.shape = shape
Note broadcast_to was added in Numpy 1.10.0.

Fill in a numpy array without creating list

I would like to create a numpy array without creating a list first.
At the moment I've got this:
import pandas as pd
import numpy as np
dfa = pd.read_csv('csva.csv')
dfb = pd.read_csv('csvb.csv')
pa = np.array(dfa['location'])
pb = np.array(dfb['location'])
ra = [(pa[i+1] - pa[i]) / float(pa[i]) for i in range(9999)]
rb = [(pb[i+1] - pb[i]) / float(pb[i]) for i in range(9999)]
ra = np.array(ra)
rb = np.array(rb)
Is there any elegant way to do in one step the last fill in of this np array without creating the list first ?
Thanks
You can calculate with vectors in numpy, without the need of lists:
ra = (pa[1:] - pa[:-1]) / pa[:-1]
rb = (pb[1:] - pb[:-1]) / pb[:-1]
The title of your question and what you need to do in your specific case are actually two slighly different things.
To create a numpy array without "casting" a list (or other iterable) you can use one of the several methods defined by numpy itself that returns array:
np.empty, np.zeros, np.ones, np.full to create arrays of given size with fixed values
np.random.* (where * can be various distributions, like normal, uniform, exponential ...), to create arrays of given size with random values
In general, read this: Array creation routines
In your case, you already have numpy arrays (pa and pb) and you don't have to create lists to calculate the new arrays (ra and rb), you can directly operate on the numpy arrays (which is the entire point of numpy: you can do operations on arrays way faster that would be iterating over each element!). Copied from #Daniel's answer:
ra = (pa[1:] - pa[:-1]) / pa[:-1]
rb = (pb[1:] - pb[:-1]) / pb[:-1]
This will be much faster than you're current implementation, not only because you avoid converting a list to ndarray, but because numpy arrays are order of magnuitude faster for mathematical and batch operations than iteration
numpy.zeros
Return a new array of given shape and type, filled with zeros.
or
numpy.ones
Return a new array of given shape and type, filled with ones.
or
numpy.empty
Return a new array of given shape and type, without initializing
entries.

Best practice for fancy indexing a numpy array along multiple axes

I'm trying to optimize an algorithm to reduce memory usage, and I've identified this particular operation as a pain point.
I have a symmetric matrix, an index array along the rows, and another index array along the columns (which is just all values that I wasn't selecting in the row index). I feel like I should just be able to pass in both indexes at the same time, but I find myself being forced to select along one axis and then the other, which is causing some memory issues because I don't actually need the copy of the array that's returned, just statistics I'm calculating from it. Here's what I am trying to do:
from scipy.spatial.distance import pdist, squareform
from sklearn import datasets
import numpy as np
iris = datasets.load_iris().data
dx = pdist(iris)
mat = squareform(dx)
outliers = [41,62,106,108,109,134,135]
inliers = np.setdiff1d( range(iris.shape[0]), outliers)
# What I want to be able to do:
scores = mat[inliers, outliers].min(axis=0)
Here's what I'm actually doing to make this work:
# What I'm being forced to do:
s1 = mat[:,outliers]
scores = s1[inliers,:].min(axis=0)
Because I'm fancy indexing, s1 is a new array instead of a view. I only need this array for one operation, so if I could eliminate returning a copy here or at least make the new array smaller (i.e. by respecting the second fancy index selection while I'm doing the first one instead of two separate fancy index operations) that would be preferable.
"Broadcasting" applies to indexing. You could convert inliers into column matrix (e.g. inliers.reshape(-1,1) or inliers[:, np.newaxis], so it has shape (m,1)) and index mat with that in the first column:
s1 = mat[inliers.reshape(-1,1), outliers]
scores = s1.min(axis=0)
There's a better way in terms of readability:
result = mat[np.ix_(inliers, outliers)].min(0)
https://docs.scipy.org/doc/numpy/reference/generated/numpy.ix_.html#numpy.ix_
Try:
outliers = np.array(outliers) # just to be sure they are arrays
result = mat[inliers[:, np.newaxis], outliers[np.newaxis, :]].min(0)

Categories

Resources