I have an application where I am loading raster data into a Dask array and then I only need to process the chunks which overlap with some region of interest. I know that I can create a Dask masked array, but I am looking for a way to prevent certain chunks from being processed at all – as some of the ROIs contain multiple polygons which are very far apart and thus 90% of the chunks will be discarded in the end.
A simple example would be, as below, where arr2 contains no information at all, but is needed for alignment of the other chunks.
import numpy as np
import dask.array as da
arr0 = da.from_array(np.arange(1, 26).reshape(5,5), chunks=(5, 5))
arr1 = da.from_array(np.arange(25, 50).reshape(5,5), chunks=(5, 5))
arr2 = da.from_array(np.zeros((5,5)), chunks=(5, 5))
arr3 = da.from_array(np.arange(75, 100).reshape(5,5), chunks=(5, 5))
a = da.block([[arr0, arr1],[arr2, arr3]])
b = da.ma.masked_equal(a, 0)
c = da.min(b)
c.visualize()
We can see by plotting the graph that arr2 is still in the computational graph, furthermore, it is taking up memory as it will still be evaluated even though it is masked. What I'd like to achieve is a way to mask the entire chunk/block such that it is just ignored in computation all together.
Related
I am trying to calculate a mean value across large numpy array. Originally, I tried:
data = (np.ones((10**6, 133))
for _ in range(100))
np.stack(data).mean(axis=0)
but I was getting
numpy.core._exceptions.MemoryError: Unable to allocate xxx GiB for an array with shape (100, 1000000, 133) and data type float32
In the original code data is a generator of more meaningful vectors.
I thought about using dask for such an operation, hoping it will split my data into chunks backed by disk.
import dask.array as da
import numpy as np
data = (np.ones((10**6, 133)) for _ in range(100))
x = da.stack(da.from_array(arr, chunks="auto") for arr in data)
x = da.mean(x, axis=0)
y = x.compute()
However, when I run it, the process terminates with "Killed".
How can I resolve this issue on a single machine?
You can try this approach:
agg_sum = np.zeros((10**6, 133))
total = 100
for dt in data:
agg_sum = agg_sum + dt
_mean = agg_sum/total
An alternative solution I found is to store all arrays in disk-backed file, using numpy.memmap.
import numpy as np
total = 100
shape = (10 ** 6, 133)
c = np.memmap(
"total.array", dtype="float64", mode="w+", shape=(total, *shape), order="C"
)
for idx, arr in enumerate(data):
c[idx,:,:] = arr[:]
del arr
c.mean(axis=0)
The important thing here is to del arr to avoid using whole memory before garbage collector reclaims unused arrays.
Note, the solution requires around 100GB of disk space, while the solution of #MSS requires much less space by keeping only the current sum.
I have a 3D numpy array and I want to shuffle it block wise in a particular axis while keeping the data in that block in it's original state. For instance I have an np array of shape (50, 140, 23) and I want to shuffle by making blocks of (50, 1, 23) on axis=1. So 140 blocks will be created and blocks should be shuffled on axis=1 while maintaining the data in blocks in it's original order. I read documentation about np.random.shuffle(x) but this only shuffles in first axis and we can't provide a block size to it.
Is there any function in numpy or a quick way to do this?
You can use a random permutation:
A = sum(np.ogrid[0:0:50j,:140,0:0:23j])
rng = np.random.default_rng()
Ashuff = A[:,rng.permutation(140),:]
Perhaps swapping axis, shuffling and swapping back might do the trick for you?
a = np.random.random((50,140,23))
b = np.swapaxes(a, 0, 1)
np.random.shuffle(b)
c = np.swapaxes(b, 0, 1)
I have 4 square arrays of the same shape
array1 = 1*np.ones((10,10))
array2 = 2*np.ones((10,10))
array3 = 3*np.ones((10,10))
array4 = 4*np.ones((10,10))
I want to recombine them into one big array in an interleaved mosaic pattern as such:
result = np.asarray([[1,2,1,2,...,1,2],\
[3,4,3,4,...,3,4],\
[1,2,1,2,...,1,2],\
...
[3,4,3,4,...,3,4]])
Where result is twice as big in both dimensions as the original individual images.
Is there an efficient way to do this?
To illustrate my question, I used arrays containing constant values but in reality, these 4 arrays would be different images.
Two common approaches for interlacing data in numpy are:
A) Assign each source to a slice of a blank result array, corresponding to where the data should go:
result = np.zeros((20, 20)) # allocate space
result[::2, ::2] = array1 # put those values in the appropriate spots
result[::2, 1::2] = array2
result[1::2, ::2] = array3
result[1::2, 1::2] = array4
B) use stacking to stick the data together in a single array, and then reshape to flatten the data in a way that leaves it interlaced. This typically requires a bit of trial and error, but after playing around with the REPL a bit I came up with:
result = np.hstack((np.dstack((array1, array2)), np.dstack((array3, array4)))).reshape(20, 20)
How can I build xarray from from an iterator of row vectors.
The resulting array may be larger than memory and will be backed by a dask array.
The row vectors also come with unique labels which need to become the row index of the resulting xarray.
In the docs I only see a constructor that takes an in memory numpy array to begin with.
An example use case would be to store a word embedding model as an xarray with words as row labels. These models usually provide an iterator that produces (string, vector) pairs over all words in the vocabulary. Most models have have in the 100s of dimensions and there are usually ~10^6 words in the vocabulary. I would like to stack the vectors into a matrix in order to perform linear algebra operations and also be able to look up rows by the word string.
I would expect to be able to write something like:
import numpy as np
import xarray as xr
vectors = (('V'+str(i), np.random.randn(10000)) for i in range(10**9))
xray = xarray_from_iter(vectors)
xray.to_parquet('big_xarray.parquet')
row1234567 = xray['V1234567']
Does xarray provide something like xarray_from_iter?
If not how do I write it?
xarray_from_iter should work something like numpy.fromiter
except that it should also label the rows as it goes.
It would also need to delay the computation until dump was called,
since the whole issue is that the that array is larger than memory.
TLDR; xarray does not have a from iterator constructor. You'll have to build your dask arrays yourself.
Also, xarray does not have a to_parquet method so that is not an operation you can do (at the moment).
Here is an example of how you might construct a dask array (and xarray.DataArray) for your use case:
import dask.array
import xarray as xr
import numpy as np
num = 10
names = []
arrays = []
for i in range(num):
names.append('V'+str(i))
arrays.append(dask.array.random.random(10000, chunks=(1000,)))
da = xr.DataArray(data, dims=('model', 'sample'), coords={'model': names})
print(da)
Yielding:
<xarray.DataArray 'stack-ff07239b7ea24834ba59f2d05b7f41e2' (model: 10,
sample: 10000)>
dask.array<shape=(10, 10000), dtype=float64, chunksize=(1, 1000)>
Coordinates:
* model (model) <U2 'V0' 'V1' 'V2' 'V3' 'V4' 'V5' 'V6' 'V7' 'V8' 'V9'
Dimensions without coordinates: sample
This is not likely to be efficient, especially when the length of the iterator gets large (like in your example). It may be worth proposing such a constructor on the dask github issues page.
The code is too complicated to paste here, but I have a numpy array shaped (800, 800, 1300), or 1300 matrices shaped (800, 800). This is 5GB.
I pass this array into a function, whereby the function
multiplies each "matrix" in the above array by a float in a (1300,) shaped array
sums the array into one "matrix", shaped (800, 800)
and takes the inverse of the matrix
This program runs at 20.2 GB RAM! Is that possible? I cannot see any memory leaks. I am simply taking numpy arrays, and passing them through a function. I then save the resulting arrays.
I'll try to post the code.
import math
import matplotlib.pyplot as plt
import numpy as np
import scipy
import scipy.io
import os
data_file1 = "filename1.npy"
data_file2 = "filename2.npy"
data_file3 = "filename3.npy"
data1 = np.load(data_file1)
data2 = np.load(data_file2)
data3 = np.load(data_file3)
data_total = np.concatenate((data1, data2, data3)) # This array is shape (800,800,1300), around 6 GB.
array1 = np.arange(1300) + 1
vector = np.arange(800) + 1
def function_matrix(data_total, vector):
Multi_matrix = array1[:, None, None] * data_total # step 1, multiplies each (800,800) matrix
Sum_matrix = np.sum(Multi_matrix, axis=0) #sum matrix
mTCm = np.array([np.dot(vector.T , (np.linalg.solve(Sum_matrix , vector)) )])
return mTCm
draw_pointsA = np.asarray([[function_matrix(data_total[i], vector[j]) for i in np.arange(0,100)] for j in np.arange(0,100)])
filename = "save_datapoints.npy"
np.save(filename, draw_pointsA)
EDIT 2:
See below. It is actually 12 GB RAM, 20.1 GB virtual size of process.
This doesn't answer your question, but proposes a way to avoid the problem from the start.
Step 1 is sequential -- you only need 1 matrix loaded at a time.
Change your code to process each matrix independently
By Step 2 your memory requirement is down to 800 * 800 * sizeof(datum), which is a few megabytes, and you can certainly afford to keep that in memory.
It sounds like this could be a type issue, i.e. you converted the values in the matrices to a different type. Perhaps you stored the original matrix with values as int16 or a single, and after multiplying it with a float, it's stored as a matrix of double values (which require 2 times more space in memory).
You can use the dtype argument to set the value type for the matrix.
Other possible reasons could be that some additional matrices are created underway. That's obviously impossible to decode unless you post the code.
A possible solution to your memory problem is to use HDF5 files, and write the matrices to disk. Then you could load the matrix one at a time. This is easy with h5py, as the matrices can be compressed, and/or sliced using numpy/scipy syntax.