Reducing memory usage when indexing in Numpy

Reducing memory usage when indexing in Numpy - python

I have a large Numpy matrix act with dtype=np.float32 and two vectors of the same length, raw_id and raw_label. I want to sort all 3 objects based on the values in raw_id. However, I get a memory error when running this script. I've isolated it to act[sortind,:] in the function below. How can I optimize the memory usage?
The arrray act is roughly 1400000 x 400, whereas raw_id and raw_label is 1400000 x 1 using dtype=np.float64. It will almost fit into my 12gb of memory along with the remaining variables that I have initialised.
def sort_by_id(act, raw_id, raw_label):
sortind = np.argsort(raw_id)
return act[sortind,:], raw_id[sortind], raw_label[sortind]
# calling function with same variables
act, raw_id, raw_label = sort_by_id(act, raw_id, raw_label)

Related

Finite difference using xarray rolling

My goal is to compute a derivative of a moving window of a multidimensional dataset along a given dimension, where the dataset is stored as Xarray DataArray or DataSet.
In the simplest case, given a 2D array I would like to compute a moving difference across multiple entries in one dimension, e.g.:
data = np.kron(np.linspace(0,1,10), np.linspace(1,4,6) ).reshape(10,6)
T=3
reducedArray = np.zeros_like(data)
for i in range(data.shape[1]):
if i < T:
reducedArray[:,i] = data[:,i] - data[:,0]
else:
reducedArray[:,i] = data[:,i] - data[:,i-T]
where the if i <T condition ensures that input and output contain proper values (i.e., no nans) and are of identical shape.
Xarray's diff aims to perform a finite-difference approximation of a given derivative order using nearest-neighbours, so it is not suitable here, hence the question:
Is it possible to perform this operation using Xarray functions only?
The rolling weighted average example appears to be something similar, but still too distinct due to the usage of NumPy routines. I've been thinking that something along the lines of the following should work:
xr2DDataArray = xr.DataArray(
data,
dims=('x','y'),
coords={'x':np.linspace(0,1,10), 'y':np.linspace(1,4,6)}
)
r = xr2DDataArray.rolling(x=T,min_periods=2)
r.reduce( redFn )
I am struggling with the definition of redFn here ,though.
Caveat The actual dataset to which the operation is to be applied will have a size of ~10GiB, so a solution that does not blow up the memory requirements will be highly appreciated!
Update/Solution
Using Xarray rolling
After sleeping on it and a bit more fiddling the post linked above actually contains a solution. To obtain a finite difference we just have to define the weights to be $\pm 1$ at the ends and $0$ else:
def fdMovingWindow(data, **kwargs):
T = kwargs['T'];
del kwargs['T'];
weights = np.zeros(T)
weights[0] = -1
weights[-1] = 1
axis = kwargs['axis']
if data.shape[axis] == T:
return np.sum(data * weights, **kwargs)
else:
return 0
r.reduce(fdMovingWindow, T=4)
alternatively, using construct and a dot product:
weights = np.zeros(T)
weights[0] = -1
weights[-1] = 1
xrWeights = xr.DataArray(weights, dims=['window'])
xr2DDataArray.rolling(y=T,min_periods=1).construct('window').dot(xrWeights)
This carries a massive caveat: The procedure essentially creates a list arrays representing the moving window. This is fine for a modest 2D / 3D array, but for a 4D array that takes up ~10 GiB in memory this will lead to an OOM death!
Simplicistic - memory efficient
A less memory-intensive way is to copy the array and work in a way similar to NumPy's arrays:
xrDiffArray = xr2DDataArray.copy()
dy = xr2DDataArray.y.values[1] - xr2DDataArray.y.values[0] #equidistant sampling
for src in xr2DDataArray:
if src.y.values < xr2DDataArray.y.values[0] + T*dy:
xrDiffArray.loc[dict(y = src.y.values)] = src.values - xr2DDataArray.values[0]
else:
xrDiffArray.loc[dict(y = src.y.values)] = src.values - xr2DDataArray.sel(y = src.y.values - dy*T).values
This will produce the intended result without dimensional errors, but it requires a copy of the dataset.
I was hoping to utilise Xarray to prevent a copy and instead just chain operations that are then evaluated if and when values are actually requested.
A suggestion as to how to accomplish this will still be welcomed!

I have never used xarray, so maybe I am mistaken, but I think you can get the result you want avoiding using loops and conditionals. This is at least twice faster than your example for numpy arrays:
data = np.kron(np.linspace(0,1,10), np.linspace(1,4,6)).reshape(10,6)
reducedArray = np.empty_like(data)
reducedArray[:, T:] = data[:, T:] - data[:, :-T]
reducedArray[:, :T] = data[:, :T] - data[:, 0, np.newaxis]
I imagine the improvement will be higher when using DataArrays.
It does not use xarray functions but neither depends on numpy functions. I am confident that translating this to xarray will be straightforward, I know that it works if there are no coords, but once you include them, you get an error because of the coords mismatch (coords of data[:, T:] and of data[:, :-T] are different). Sadly, I can't do better now.

resample and groupby on big dask array with xarray - using map_blocks?

I have a custom workflow, that requires using resample to get to a higher temporal frequency, applying a ufunc, and groupby + mean to compute the final result.
I would like to apply this to a big xarray dataset, which is backed by a chunked dask array. For computation, I'd like to use dask.distributed.
However, when I apply this to the full dataset, the number of tasks skyrockets, overwhelming the client and most likely also the scheduler and workers if submitted.
The xarray docs explain:
Do your spatial and temporal indexing (e.g. .sel() or .isel()) early
in the pipeline, especially before calling resample() or groupby().
Grouping and rasampling triggers some computation on all the blocks,
which in theory should commute with indexing, but this optimization
hasn’t been implemented in dask yet.
But I really need to apply this to the full temporal axis.
So how to best implement this?
My approach was to use map_blocks, to apply this function for each chunk individually as to keep the individual xarray sub-datasets small enough.
This seems to work on a small scale, but when I use the full dataset, the workers run out of memory and quickly die.
Looking at the dashboard, the function I'm applying to the array gets executed multiple times of the number of chunks I have. Shouldn't these two numbers line up?
So my questions are:
Is this approach valid?
How could I implement this workflow otherwise, besides manually implementing the resample and groupby part and putting it in a ufunc?
Any ideas regarding the performance issues at scale (specifically the number of executions vs chunks)?
Here's a small example that mimics the workflow and shows the number of executions vs chunks:
from time import sleep
import dask
from dask.distributed import Client, LocalCluster
import numpy as np
import pandas as pd
import xarray as xr
def ufunc(x):
# computation
sleep(2)
return x
def fun(x):
# upsample to higher res
x = x.resample(time="1h").asfreq().fillna(0)
# apply function
x = xr.apply_ufunc(ufunc, x, input_core_dims=[["time"]], output_core_dims=[['time']], dask="parallelized")
# average over dates
x['time'] = x.time.dt.strftime("%Y-%m-%d")
x = x.groupby("time").mean()
return x
def create_xrds(shape):
''' helper function to create dataset'''
x,y,t = shape
tv = pd.date_range(start="1970-01-01", periods=t)
ds = xr.Dataset({
"band": xr.DataArray(
dask.array.zeros(shape, dtype="int16"),
dims=['x', 'y', 'time'],
coords={"x": np.arange(0, x), "y": np.arange(0, y), "time": tv})
})
return ds
# set up distributed
cluster = LocalCluster(n_workers=2)
client = Client(cluster)
ds = create_xrds((500,500,500)).chunk({"x": 100, "y": 100, "time": -1})
# create template
template = ds.copy()
template['time'] = template.time.dt.strftime("%Y-%m-%d")
# map fun to blocks
ds_out = xr.map_blocks(fun, ds, template=template)
# persist
ds_out.persist()
Using the example above, this is how the dask array (25 chunks) looks like:
But the function fun gets executed 125 times:

Looking at the dashboard, the function I'm applying to the array gets executed multiple times of the number of chunks I have. Shouldn't these two numbers line up?
This is misleading because of an unfortunate choice made when making the graph. The number includes tasks that make a block of the input Dataset (one per variable per chunk) & for the output Dataset as well as tasks that apply the function. This will get fixed soon (https://github.com/pydata/xarray/pull/5007)

Applying a simple function to CSV and save multiple csv files

I am trying to replicate the data by multiplying every value with a range of values and save the results as CSV.
I have created a function "Replicate_Data" which takes the input numpy array and multiply with a random value between a range. What is the best way to create a 100 files and save as P3D1 , P4D1 and so on.
def Replicate_Data(data: np.ndarray) -> np.ndarray:
Rep_factor= random.uniform(-3,7)
data1 = data * Rep_factor
return data1
P2D1 = Replicate_Data(P1D1)
np.savetxt("P2D1.csv", P2D1, delimiter="," , dtype = complex)

Here is an example you can use as reference.
I generate toy data named toy, then I make n random values using np.random.uniform and call it randos, then I multiply these two objects to form out using numpy broadcasting. You could also do this multiplication in a loop (the same one you save in, in fact): depending on the size of your input array it could be very memory intensive as I've written it. A more complete answer probably depends on the shape of your input data.
import numpy as np
toy = np.random.random(size=(2,2)) # a toy input array
n = 100 # number of random values
randos = np.random.uniform(-3,7,size=n) # generate 100 uniform randoms
# now multiply all elements in toy by the randoms in randos
out = toy[None,...]*randos[...,None,None] # this depends on the shape.
# this will work only if toy has two dimensions. Otherwise requires modification
# it will take a lot of memory... 100*toy.nbytes worth
# now save in the loop..
for i,o in enumerate(out):
name = 'P{}D1'.format(str(i+1))
np.savetxt(name,o,delimiter=",")
# a second way without the broadcasting (slow, better on memory)
# more like 2*toy.nbytes
#for i,r in enumerate(randos):
# name = 'P{}D1'.format(str(i+1))
# np.savetxt(name,r*toy,delimiter=",")

Bus Error : Core dumped when using numpy memmap

I'm trying to perform calculations on very large arrays, dimensions 65536 x 65536. Since I read that np.memmap can allow me to perform calculations out of core so that I'm not limited by memory I tried to do this. This method works for small arrays like 8192 x 8192. However, when I try for the larger dimension, I get a bus error(core dumped). What could be causing this issue? And how can I overcome it? Would appreciate any advice. Code is below. I have 2 arrays X and Y, stored in binary format which I load and perform calculations on. Additionally, I have a RAM of 128 GB so there is no issue in these new arrays being allocated.
import numpy as np
X = np.memmap('X.bin',dtype='float64',mode='r',shape=(65536,65536))
Y = np.memmap('Y.bin',dtype='float64',mode='r',shape=(65536,65536))
A = np.fft.rfft2(np.fft.fftshift(X))
B = np.fft.rfft2(Y)
C = np.fft.irfft2(A*B)
alpha_x,alpha_y = np.gradient(C,edge_order=2)
Together, this would be
alpha_x,alpha_y = np.gradient(np.fft.irfft2(np.fft.rfft2(Y)*np.fft.rfft2(np.fft.fftshift(X))),edge_order=2)

Improving moving-window computation in memory consumption and speed

Is it possible to obtain better performance (both in memory consumption and speed) in this moving-window computation? I have a 1000x1000 numpy array and I take 16x16 windows through the whole array and finally apply some function to each window (in this case, a discrete cosine transform.)
import numpy as np
from scipy.fftpack import dct
from skimage.util import view_as_windows
X = np.arange(1000*1000, dtype=np.float32).reshape(1000,1000)
window_size = 16
windows = view_as_windows(X, (window_size,window_size))
dcts = np.zeros(windows.reshape(-1,window_size, window_size).shape, dtype=np.float32)
for idx, window in enumerate(windows.reshape(-1,window_size, window_size)):
dcts[idx, :, :] = dct(window)
dcts = dcts.reshape(windows.shape)
This code takes too much memory (in the example above, the memory consumption is not so bad - windows uses 1Gb and dcts also needs 1Gb) and is taking 25 seconds to complete. I'm a bit unsure as to what I'm doing wrong because this should be a straightforward calculation (e.g. filtering an image.) Is there a better way to accomplish this?
UPDATE:
I was initially worried that the arrays produced by Kington's solution and my initial approach were very different, but the difference is restricted to the boundaries, so it is unlikely to cause serious issues for most applications. The only remaining problem is that both solutions are very slow. Currently, the first solution takes 1min 10s and the second solution 59 seconds.
UPDATE 2:
I noticed the biggest culprits by far are dct and np.mean. Even generic_filter performs decently (8.6 seconds) using a "cythonized" version of mean with bottleneck:
import bottleneck as bp
def func(window, shape):
window = window.reshape(shape)
#return np.abs(dct(dct(window, axis=1), axis=0)).mean()
return bp.nanmean(dct(window))
result = scipy.ndimage.generic_filter(X, func, (16, 16),
extra_arguments=([16, 16],))
I'm currently reading how to wrap C code using numpy in order to replace scipy.fftpack.dct. If anyone knows how to do it, I would appreciate the help.

Since scipy.fftpack.dct calculates separate transforms along the last axis of the input array, you can replace your loop with:
windows = view_as_windows(X, (window_size,window_size))
dcts = dct(windows)
result1 = dcts.mean(axis=(2,3))
Now only the dcts array requires a lot of memory and windows remains merely a view into X. And because the DCT's are calculated with a single function call it's also much faster. However, because the windows overlap there are lots of repeated calculations. This can be overcome by only calculating the DCT for each sub-row once, followed by a windowed mean:
ws = window_size
row_dcts = dct(view_as_windows(X, (1, ws)))
cs = row_dcts.squeeze().sum(axis=-1).cumsum(axis=0)
result2 = np.vstack((cs[ws-1], cs[ws:]-cs[:-ws])) / ws**2
Though it seems what is gained in effeciency is lost in code clarity... But basically the approach here is to first calculate the DCT's and then take the window average by summing over the 2D window and then dividing by the number of elements in the window. The DCTs are already calculated over rowwise moving windows, so we take a regular sum over those windows. However we need to take a moving window sum over the columns, to arrive at the proper 2D window sums. To do this efficiently we use a cumsum trick, where:
sum(A[p:q]) # q-p == window_size
Is equivalent to:
cs = cumsum(A)
cs[q-1] - cs[p-1]
This avoids having to sum the exact same numbers over and over. Unfortunately it doesn't work for the first window (when p == 0), so for that we have to take only cs[q-1] and stack it together with the other window sums. Finally we divide by the number of elements to arrive at the 2D window average.
If you like to do a 2D DCT than this second approach becomes less interesting, beause you'll eventually need the full 985 x 985 x 16 x 16 array before you can take the mean.
Both approaches above should be equivalent, but it may be a good idea to perform the arithmetic with 64-bit floats:
np.allclose(result1, result2, atol=1e-6)
# False
np.allclose(result1, result2, atol=1e-5)
# True

skimage.util.view_as_windows is using striding tricks to make an array of overlapping "windows" that doesn't use any additional memory.
However, when you make a new array of the shape shape, it will require ~32 times (16 x 16) the memory that your original X array or the windows array used.
Based on your comment, your end result is doing dcts.reshape(windows.shape).mean(axis=2).mean(axis=2) - taking the mean of the dct of each window.
Therefore, it would be more memory-efficient (though similar performance wise) to take the mean inside the loop and not store the huge intermediate array of windows:
import numpy as np
from scipy.fftpack import dct
from skimage.util import view_as_windows
X = np.arange(1000*1000, dtype=np.float32).reshape(1000,1000)
window_size = 16
windows = view_as_windows(X, (window_size, window_size))
dcts = np.zeros(windows.shape[:2], dtype=np.float32).ravel()
for idx, window in enumerate(windows.reshape(-1, window_size, window_size)):
dcts[idx] = dct(window).mean()
dcts = dcts.reshape(windows.shape[:2])
Another option is scipy.ndimage.generic_filter. It won't increase performance much (the bottleneck is the python function call in the inner loop), but you'll have a lot more boundary condition options, and it will be fairly memory efficient:
import numpy as np
from scipy.fftpack import dct
import scipy.ndimage
X = np.arange(1000*1000, dtype=np.float32).reshape(1000,1000)
def func(window, shape):
window = window.reshape(shape)
return dct(window).mean()
result = scipy.ndimage.generic_filter(X, func, (16, 16),
extra_arguments=([16, 16],))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reducing memory usage when indexing in Numpy - python

Related

Finite difference using xarray rolling

resample and groupby on big dask array with xarray - using map_blocks?

Applying a simple function to CSV and save multiple csv files

Bus Error : Core dumped when using numpy memmap

Improving moving-window computation in memory consumption and speed

Categories

Resources