Applying aggregation function over large numpy array

Applying aggregation function over large numpy array - python

I am trying to calculate a mean value across large numpy array. Originally, I tried:
data = (np.ones((10**6, 133))
for _ in range(100))
np.stack(data).mean(axis=0)
but I was getting
numpy.core._exceptions.MemoryError: Unable to allocate xxx GiB for an array with shape (100, 1000000, 133) and data type float32
In the original code data is a generator of more meaningful vectors.
I thought about using dask for such an operation, hoping it will split my data into chunks backed by disk.
import dask.array as da
import numpy as np
data = (np.ones((10**6, 133)) for _ in range(100))
x = da.stack(da.from_array(arr, chunks="auto") for arr in data)
x = da.mean(x, axis=0)
y = x.compute()
However, when I run it, the process terminates with "Killed".
How can I resolve this issue on a single machine?

You can try this approach:
agg_sum = np.zeros((10**6, 133))
total = 100
for dt in data:
agg_sum = agg_sum + dt
_mean = agg_sum/total

An alternative solution I found is to store all arrays in disk-backed file, using numpy.memmap.
import numpy as np
total = 100
shape = (10 ** 6, 133)
c = np.memmap(
"total.array", dtype="float64", mode="w+", shape=(total, *shape), order="C"
)
for idx, arr in enumerate(data):
c[idx,:,:] = arr[:]
del arr
c.mean(axis=0)
The important thing here is to del arr to avoid using whole memory before garbage collector reclaims unused arrays.
Note, the solution requires around 100GB of disk space, while the solution of #MSS requires much less space by keeping only the current sum.

Related

Fastest way to store 3D numpy array in a loop

I need to store a numpy array of shape (2000,720,1280) which is created in every loop. My code looks like:
U_list = []
for N_f in range(N):
U = somefunction(N_f)
U_list.append(U)
del U
So I delete the matrix U in every loop because my RAM get full.
Is this a good method to store the matrix U or would you recommend another solution? I compare the code to matlab and matlab need the half time to compute. I think the storage of U in a list could be the reason.

Using this method will tell you if you are able to store the total U arrays right out the gate. If N is so large that you can't make the results numpy array, you'll have to get creative. Maybe save every 20 into a pickle file or something.
import numpy as np
N = 20
shape = (2000, 720, 1280)
#Make sure to match the dtype returned by somefunction
results = np.zeros((N, *shape))
for N_f in range(N):
results[N_f] = somefunction(N_f)

Dask chunk masking

I have an application where I am loading raster data into a Dask array and then I only need to process the chunks which overlap with some region of interest. I know that I can create a Dask masked array, but I am looking for a way to prevent certain chunks from being processed at all – as some of the ROIs contain multiple polygons which are very far apart and thus 90% of the chunks will be discarded in the end.
A simple example would be, as below, where arr2 contains no information at all, but is needed for alignment of the other chunks.
import numpy as np
import dask.array as da
arr0 = da.from_array(np.arange(1, 26).reshape(5,5), chunks=(5, 5))
arr1 = da.from_array(np.arange(25, 50).reshape(5,5), chunks=(5, 5))
arr2 = da.from_array(np.zeros((5,5)), chunks=(5, 5))
arr3 = da.from_array(np.arange(75, 100).reshape(5,5), chunks=(5, 5))
a = da.block([[arr0, arr1],[arr2, arr3]])
b = da.ma.masked_equal(a, 0)
c = da.min(b)
c.visualize()
We can see by plotting the graph that arr2 is still in the computational graph, furthermore, it is taking up memory as it will still be evaluated even though it is masked. What I'd like to achieve is a way to mask the entire chunk/block such that it is just ignored in computation all together.

Printing and reading Numpy arrays efficiently

I would like to print a Numpy array and then read it back. This is what I have done so far:
#printer
import numpy as np
N = 100
x = np.arange(N)
for xi in x:
print(xi)
#reader
import numpy as np
N = 100
x = np.empty(N)
for i in range(N):
x[i] = float(input())
This gets the job done but I think that it may not be the most
efficient way due to the multiple uses of input(). An alternative way I considered is printing only once, reading only once and modifying what I read. This approach has some similarities with this question. In contrast to that question, I have some extra info that could possibly be used to improve performance:
N is known in advance(to both programs)
Arrays are only 1D or 2D(of sizes N and NxN respectively)
Data are float
Data are fully trusted
Thanks in advance.
Edit: I have to add that the value of N will not be that large, even N=1000 will be huge for my problem.

My data input is 5 GB of numpy arrays, yet running through a function takes 20 GB. Why?

The code is too complicated to paste here, but I have a numpy array shaped (800, 800, 1300), or 1300 matrices shaped (800, 800). This is 5GB.
I pass this array into a function, whereby the function
multiplies each "matrix" in the above array by a float in a (1300,) shaped array
sums the array into one "matrix", shaped (800, 800)
and takes the inverse of the matrix
This program runs at 20.2 GB RAM! Is that possible? I cannot see any memory leaks. I am simply taking numpy arrays, and passing them through a function. I then save the resulting arrays.
I'll try to post the code.
import math
import matplotlib.pyplot as plt
import numpy as np
import scipy
import scipy.io
import os
data_file1 = "filename1.npy"
data_file2 = "filename2.npy"
data_file3 = "filename3.npy"
data1 = np.load(data_file1)
data2 = np.load(data_file2)
data3 = np.load(data_file3)
data_total = np.concatenate((data1, data2, data3)) # This array is shape (800,800,1300), around 6 GB.
array1 = np.arange(1300) + 1
vector = np.arange(800) + 1
def function_matrix(data_total, vector):
Multi_matrix = array1[:, None, None] * data_total # step 1, multiplies each (800,800) matrix
Sum_matrix = np.sum(Multi_matrix, axis=0) #sum matrix
mTCm = np.array([np.dot(vector.T , (np.linalg.solve(Sum_matrix , vector)) )])
return mTCm
draw_pointsA = np.asarray([[function_matrix(data_total[i], vector[j]) for i in np.arange(0,100)] for j in np.arange(0,100)])
filename = "save_datapoints.npy"
np.save(filename, draw_pointsA)
EDIT 2:
See below. It is actually 12 GB RAM, 20.1 GB virtual size of process.

This doesn't answer your question, but proposes a way to avoid the problem from the start.
Step 1 is sequential -- you only need 1 matrix loaded at a time.
Change your code to process each matrix independently
By Step 2 your memory requirement is down to 800 * 800 * sizeof(datum), which is a few megabytes, and you can certainly afford to keep that in memory.

It sounds like this could be a type issue, i.e. you converted the values in the matrices to a different type. Perhaps you stored the original matrix with values as int16 or a single, and after multiplying it with a float, it's stored as a matrix of double values (which require 2 times more space in memory).
You can use the dtype argument to set the value type for the matrix.
Other possible reasons could be that some additional matrices are created underway. That's obviously impossible to decode unless you post the code.
A possible solution to your memory problem is to use HDF5 files, and write the matrices to disk. Then you could load the matrix one at a time. This is easy with h5py, as the matrices can be compressed, and/or sliced using numpy/scipy syntax.

Memory Error when using float32 in dask array

I am trying to import a 1.25 GB dataset into python using dask.array
The file is a 1312*2500*196 Array of uint16's. I need to convert this to a float32 array for later processing.
I have managed to stitch together this Dask array in uint16, however when I try to convert to float32 I get a memory error.
It doesn't matter what I do to the chunk size, I will always get a memory error.
I create the array by concatenating the array in lines of 100 (breaking the 2500 dimension up into little pieces of 100 lines, since dask can't natively read .RAW imaging files I have to use numpy.memmap() to read the file and then create the array.
Below I will supply a "as short as possible" code snippet:
I have tried two methods:
1) Create the full uint16 array and then try to convert to float32:
(note: the memmap is a 1312x100x196 array and lines ranges from 0 to 24)
for i in range(lines):
NewArray = da.concatenate([OldArray,Memmap],axis=0)
OldArray = NewArray
return NewArray
and then I use
Float32Array = FinalArray.map_blocks(lambda FinalArray: FinalArray * 1.,dtype=np.float32)
In method 2:
for i in range(lines):
NewArray = da.concatenate([OldArray,np.float32(Memmap)],axis=0)
OldArray = NewArray
return NewArray
Both methods result in a memory error.
Is there any reason for this?
I read that dask array is capable of doing up to 100 GB dataset calculations.
I tried all chunk sizes (from as small as 10x10x10 to a single line)

You can create a dask.array from a numpy memmap array directly with the da.from_array function
x = load_memmap_numpy_array_from_raw_file(filename)
d = da.from_array(x, chunks=...)
You can change the dtype with the astype method
d = d.astype(np.float32)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Applying aggregation function over large numpy array - python

You can try this approach: agg_sum = np.zeros((10**6, 133)) total = 100 for dt in data: agg_sum = agg_sum + dt _mean = agg_sum/total

Related

Fastest way to store 3D numpy array in a loop

Dask chunk masking

Printing and reading Numpy arrays efficiently

My data input is 5 GB of numpy arrays, yet running through a function takes 20 GB. Why?

Memory Error when using float32 in dask array

Categories

Resources