Bus Error : Core dumped when using numpy memmap

Bus Error : Core dumped when using numpy memmap - python

I'm trying to perform calculations on very large arrays, dimensions 65536 x 65536. Since I read that np.memmap can allow me to perform calculations out of core so that I'm not limited by memory I tried to do this. This method works for small arrays like 8192 x 8192. However, when I try for the larger dimension, I get a bus error(core dumped). What could be causing this issue? And how can I overcome it? Would appreciate any advice. Code is below. I have 2 arrays X and Y, stored in binary format which I load and perform calculations on. Additionally, I have a RAM of 128 GB so there is no issue in these new arrays being allocated.
import numpy as np
X = np.memmap('X.bin',dtype='float64',mode='r',shape=(65536,65536))
Y = np.memmap('Y.bin',dtype='float64',mode='r',shape=(65536,65536))
A = np.fft.rfft2(np.fft.fftshift(X))
B = np.fft.rfft2(Y)
C = np.fft.irfft2(A*B)
alpha_x,alpha_y = np.gradient(C,edge_order=2)
Together, this would be
alpha_x,alpha_y = np.gradient(np.fft.irfft2(np.fft.rfft2(Y)*np.fft.rfft2(np.fft.fftshift(X))),edge_order=2)

Related

How to schedule multiple 1d FFTs using Scikit-cuda FFT?

I'm looking to parallelize multiple 1d FFTs using CUDA. I'm working on a GTX 1050Ti with CUDA 6.1.
For instance in the code I attached, I have a 3d input array 'data', and I want to do 1d FFTs over the second dimension of this array. The purpose is, of course, to speed up the execution time by an order of magnitude.
I'm able to use Python's scikit-cuda's cufft package to run a batch of 1 1d FFT and the results match with NumPy's FFT. The problem comes when I go to a real batch size. There, I'm not able to match the NumPy's FFT output (which is the correct one) with cufft's output (which I believe isn't correct). In the code attached, parameter 'singleFFT' controls whether we schedule a batch of 1 or many. Help in correcting the output FFT and also speeding up execution further (if possible) will be greatly appreciated.
import numpy as np
from time import process_time
from skcuda import cufft as cf
import pycuda.autoinit
from pycuda import gpuarray
# params
nSamp = 512
nTx = 16
nRx = 16
nChirp = 256
NX = nChirp
# Uncomment the following line to generate same data always
# np.random.seed(seed=1)
data = (np.random.randn(nSamp,nChirp,nTx,nRx) + 1j*np.random.randn(nSamp,nChirp,nTx,nRx)).astype(np.complex64)
data = data.reshape(nSamp,-1,nTx*nRx)
dataShp0 = np.int32(data.shape[0])
dataShp2 = np.int32(data.shape[2])
idx1 = 0
idx2 = 0
idx3 = 0
singleFFT = 0
if (1 == singleFFT):
data_t = data[0,:,0]
fftAxis = 0
BATCH = np.int32(1)
else:
data_t = data
fftAxis = 1
BATCH = np.int32(nSamp*nTx*nRx)
# calculate and time NumPy FFT
t1 = process_time()
dataFft = np.fft.fft(data_t, axis=fftAxis)
t2 = process_time()
print('\nCPU NumPy time is: ',t2-t1)
data_o_gpu = gpuarray.empty((BATCH*NX),dtype=np.complex64)
# calculate and time GPU FFT
data_t = data_t.reshape((BATCH*NX))
t1 = process_time()
# transfer input data to Device
data_t_gpu = gpuarray.to_gpu(data_t)
# Make FFT plan
plan = cf.cufftPlan1d(NX, cf.CUFFT_C2C, BATCH)
# Execute FFT plan
res = cf.cufftExecC2C(plan, int(data_t_gpu.gpudata), int(data_o_gpu.gpudata), cf.CUFFT_FORWARD)
dataFft_gpu = data_o_gpu.get()
t2 = process_time()
if (0 == singleFFT):
dataFft_gpu = dataFft_gpu.reshape((nSamp,-1,nTx*nRx))
print('\nGPU time is: ',t2-t1)
print(np.allclose(dataFft,dataFft_gpu,atol=1e-6))
The last line in the code matches the result of NumPy's FFT with cuFFT. It could be seen with singleFFT=1, the result is True, while for singleFFT=0 (i.e. batch of many 1d FFTs), the result is False.

Post my attempts, I would want to conclude that:
Using cufft library from skcuda is a bit tricky and to get to the correct FFT output might take a long time, in development. I also noticed that there wasn't an order of magnitude difference in execution time between NumPy's FFT and cufft's FFT (from skcuda)
Using CuPy and arranging your data in a format so that the FFT dimension is laid out in contiguous memory gives an order of magnitude improvement in the FFT compute time. For my case, the order was a little better than 10!
Using CuPy for FFTs is a great option if one wants to stick to Py-based development only. Also the to and fro from C to Python when writing C GPU kernels is an added overhead which is very conveniently resolved with CuPy. Though CuPy itself calls laying out the plan and calling the FFT exec engine internally.

fast downsampling of huge matrix using python (numpy memmap, pytables or other?)

As part of my data processing I produce huge non sparse matrices in the order of 100000*100000 cells, which I want to downsample by a factor of 10 to reduce the amount of data. In this case I want to average over blocks of 10*10 pixels, to reduce the size of my matrix from 100000*100000 to 10000*10000.
What is the fastest way to do so using python? It does not matter for me if I need to save my original data to a new dataformat, because I have to do the downsampling of the same dataset multiple times.
Currently I am using numpy.memmap:
import numpy as np
data_1 = 'data_1.dat'
date_2 = 'data_2.dat'
lines = 100000
pixels = 100000
window = 10
new_lines = lines / window
new_pixels = pixels / window
dat_1 = np.memmap(data_1, dtype='float32', mode='r', shape=(lines, pixels))
dat_2 = np.memmap(data_2, dtype='float32', mode='r', shape=(lines, pixels))
dat_in = dat_1 * dat_2
dat_out = dat_in.reshape([new_lines, window, new_pixels, window]).mean(3).mean(1)
But with with large files this method becomes very slow. Likely this has something to do with the binary data of these files, which are ordered by line. Therefore, I think that a data format which stores my data in blocks instead of lines will be faster, but I am not sure what the performance gain will be and whether there are python packages who support this.
I have also thought about downsampling of the data before creating such a huge matrix (not shown here), but my input data is fractured and irregular, so that would become very complex.

Based on this answer, I think this might be a relatively fast method, depending on how much overhead reshape gives you with memmap.
def downSample(a, window):
i, j = a.shape
ir = np.arange(0, i, window)
jr = np.arange(0, j, window)
n = 1./(window**2)
return n * np.add.reduceat(np.add.reduceat(a, ir), jr, axis=1)
Hard to test speed without your dataset.

This avoids an intermediate copy, as the reshape keeps dimensions contiguous
dat_in.reshape((lines/window, window, pixels/window, window)).mean(axis=(1,3))

sort/lexsort/bincount/etc. for arrays that don't fit in memory

I'm trying to scale up a library written in numpy so that it can process arrays that don't fit in memory (~10 arrays of 10 billion elements)
hdf5 (h5py) was a temporary solution, but I rely heavily on sorting and indexing (b = a[a>5]), which are both not available in h5py and are a pain to write.
Is there a library that would made these tools available?
Specifically I need basic math, sort, lexsort, argsort, bincount, np.diff, and indexing (both boolean and with the array of indices).

PyTables is designed precisely for this (also based on hdf5). First store your array to disk
import numpy as np
import tables as tb
# Write big numpy array to disk
rows, cols = 80000000, 2
h5file = tb.open_file('test.h5', mode='w', title="Test Array")
root = h5file.root
array_on_disk = h5file.create_carray(root,
'array_on_disk',tb.Float64Atom(),shape=(rows,cols))
# Fill part of the array
rand_array = np.random.rand(1000)
array_on_disk[10055:11055] = rand_array
array_on_disk[12020:13020] = 2.*rand_array
h5file.close()
Then perform your computation directly on the array (or part of it) contained in the file
h5file = tb.open_file('disk_array.h5', mode='r')
print h5file.root.array_on_disk[10050:10065,0]
# in-place sort
h5file.root.array_on_disk[100000:10000000,:].sort(axis=0)
h5file.close()

Reducing memory usage when indexing in Numpy

I have a large Numpy matrix act with dtype=np.float32 and two vectors of the same length, raw_id and raw_label. I want to sort all 3 objects based on the values in raw_id. However, I get a memory error when running this script. I've isolated it to act[sortind,:] in the function below. How can I optimize the memory usage?
The arrray act is roughly 1400000 x 400, whereas raw_id and raw_label is 1400000 x 1 using dtype=np.float64. It will almost fit into my 12gb of memory along with the remaining variables that I have initialised.
def sort_by_id(act, raw_id, raw_label):
sortind = np.argsort(raw_id)
return act[sortind,:], raw_id[sortind], raw_label[sortind]
# calling function with same variables
act, raw_id, raw_label = sort_by_id(act, raw_id, raw_label)

Improving moving-window computation in memory consumption and speed

Is it possible to obtain better performance (both in memory consumption and speed) in this moving-window computation? I have a 1000x1000 numpy array and I take 16x16 windows through the whole array and finally apply some function to each window (in this case, a discrete cosine transform.)
import numpy as np
from scipy.fftpack import dct
from skimage.util import view_as_windows
X = np.arange(1000*1000, dtype=np.float32).reshape(1000,1000)
window_size = 16
windows = view_as_windows(X, (window_size,window_size))
dcts = np.zeros(windows.reshape(-1,window_size, window_size).shape, dtype=np.float32)
for idx, window in enumerate(windows.reshape(-1,window_size, window_size)):
dcts[idx, :, :] = dct(window)
dcts = dcts.reshape(windows.shape)
This code takes too much memory (in the example above, the memory consumption is not so bad - windows uses 1Gb and dcts also needs 1Gb) and is taking 25 seconds to complete. I'm a bit unsure as to what I'm doing wrong because this should be a straightforward calculation (e.g. filtering an image.) Is there a better way to accomplish this?
UPDATE:
I was initially worried that the arrays produced by Kington's solution and my initial approach were very different, but the difference is restricted to the boundaries, so it is unlikely to cause serious issues for most applications. The only remaining problem is that both solutions are very slow. Currently, the first solution takes 1min 10s and the second solution 59 seconds.
UPDATE 2:
I noticed the biggest culprits by far are dct and np.mean. Even generic_filter performs decently (8.6 seconds) using a "cythonized" version of mean with bottleneck:
import bottleneck as bp
def func(window, shape):
window = window.reshape(shape)
#return np.abs(dct(dct(window, axis=1), axis=0)).mean()
return bp.nanmean(dct(window))
result = scipy.ndimage.generic_filter(X, func, (16, 16),
extra_arguments=([16, 16],))
I'm currently reading how to wrap C code using numpy in order to replace scipy.fftpack.dct. If anyone knows how to do it, I would appreciate the help.

Since scipy.fftpack.dct calculates separate transforms along the last axis of the input array, you can replace your loop with:
windows = view_as_windows(X, (window_size,window_size))
dcts = dct(windows)
result1 = dcts.mean(axis=(2,3))
Now only the dcts array requires a lot of memory and windows remains merely a view into X. And because the DCT's are calculated with a single function call it's also much faster. However, because the windows overlap there are lots of repeated calculations. This can be overcome by only calculating the DCT for each sub-row once, followed by a windowed mean:
ws = window_size
row_dcts = dct(view_as_windows(X, (1, ws)))
cs = row_dcts.squeeze().sum(axis=-1).cumsum(axis=0)
result2 = np.vstack((cs[ws-1], cs[ws:]-cs[:-ws])) / ws**2
Though it seems what is gained in effeciency is lost in code clarity... But basically the approach here is to first calculate the DCT's and then take the window average by summing over the 2D window and then dividing by the number of elements in the window. The DCTs are already calculated over rowwise moving windows, so we take a regular sum over those windows. However we need to take a moving window sum over the columns, to arrive at the proper 2D window sums. To do this efficiently we use a cumsum trick, where:
sum(A[p:q]) # q-p == window_size
Is equivalent to:
cs = cumsum(A)
cs[q-1] - cs[p-1]
This avoids having to sum the exact same numbers over and over. Unfortunately it doesn't work for the first window (when p == 0), so for that we have to take only cs[q-1] and stack it together with the other window sums. Finally we divide by the number of elements to arrive at the 2D window average.
If you like to do a 2D DCT than this second approach becomes less interesting, beause you'll eventually need the full 985 x 985 x 16 x 16 array before you can take the mean.
Both approaches above should be equivalent, but it may be a good idea to perform the arithmetic with 64-bit floats:
np.allclose(result1, result2, atol=1e-6)
# False
np.allclose(result1, result2, atol=1e-5)
# True

skimage.util.view_as_windows is using striding tricks to make an array of overlapping "windows" that doesn't use any additional memory.
However, when you make a new array of the shape shape, it will require ~32 times (16 x 16) the memory that your original X array or the windows array used.
Based on your comment, your end result is doing dcts.reshape(windows.shape).mean(axis=2).mean(axis=2) - taking the mean of the dct of each window.
Therefore, it would be more memory-efficient (though similar performance wise) to take the mean inside the loop and not store the huge intermediate array of windows:
import numpy as np
from scipy.fftpack import dct
from skimage.util import view_as_windows
X = np.arange(1000*1000, dtype=np.float32).reshape(1000,1000)
window_size = 16
windows = view_as_windows(X, (window_size, window_size))
dcts = np.zeros(windows.shape[:2], dtype=np.float32).ravel()
for idx, window in enumerate(windows.reshape(-1, window_size, window_size)):
dcts[idx] = dct(window).mean()
dcts = dcts.reshape(windows.shape[:2])
Another option is scipy.ndimage.generic_filter. It won't increase performance much (the bottleneck is the python function call in the inner loop), but you'll have a lot more boundary condition options, and it will be fairly memory efficient:
import numpy as np
from scipy.fftpack import dct
import scipy.ndimage
X = np.arange(1000*1000, dtype=np.float32).reshape(1000,1000)
def func(window, shape):
window = window.reshape(shape)
return dct(window).mean()
result = scipy.ndimage.generic_filter(X, func, (16, 16),
extra_arguments=([16, 16],))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Bus Error : Core dumped when using numpy memmap - python

Related

How to schedule multiple 1d FFTs using Scikit-cuda FFT?

fast downsampling of huge matrix using python (numpy memmap, pytables or other?)

sort/lexsort/bincount/etc. for arrays that don't fit in memory

Reducing memory usage when indexing in Numpy

Improving moving-window computation in memory consumption and speed

Categories

Resources