I have a large .tiff file (4.4gB, 79530 x 54980 values) with 1 band. Since only 16% of the values are valid, I was thinking it's better to import the file as sparse matrix, to save RAM. When I first open it as np.array and then transform it into a sparse matrix using csr_matrix(), my kernel already crashes. See code below.
from osgeo import gdal
import numpy as np
from scipy.sparse import csr_matrix
ds = gdal.Open("file.tif")
band = ds.GetRasterBand(1)
array = np.array(band.ReadAsArray())
csr_matrix(array)
Is there a better way to work with this file? In the end I have to make calculations based on the values in the raster. (Unfortunately, due to confidentiality, I cannot attach the relevant file.)
Can you tell where the crash occurs?
band = ds.GetRasterBand(1)
temp = band.ReadAsArray()
array = np.array(temp) # if temp is already an array, you don't need this
csr_matrix(array)
If array is 4.4gB, (79530, 54980)
In [62]: (79530 * 54980) / 1e9
Out[62]: 4.3725594 # 4.4gB makes sense for 1 byte/element
In [63]: (79530 * 54980) * 0.16 # 16% density
Out[63]: 699609504.0 # number of nonzero values
creating csr requires doing np.nonzero(array) to get the indices. That will produce 2 arrays of this 0.7 * 8 Gb size (indices are 8 byte ints). coo format actually requires those 2 arrays plus 0.7 for the nonzero values - about 12 Gb . Converted to csr, the row attribute is reduced to 79530 elements - so about 7 Gb . (corrected for 8 bytes/element)
So at 16% density, the sparse format is, at it's best, is still larger than the dense version.
Memory error when converting matrix to sparse matrix, specified dtype is invalid
is a recent case of a memory error - which occurred in nonzero step.
Assuming you know size of your matrix, you can create an empty sparse matrix, and then set only valid values one-by-one.
from osgeo import gdal
import numpy as np
from scipy.sparse import csr_matrix
ds = gdal.Open("file.tif")
band = ds.GetRasterBand(1)
matrix_size = (1000, 1000) # set you size
matrix = csr_matrix(matrix_size)
# for each valid value
matrix[i, j] = your_value
Edit 1
If you don't know size of your matrix, you should be able to check it like this:
from osgeo import gdal
ds = gdal.Open("file.tif")
width = ds.GetRasterXSize()
height = ds.GetRasterYSize()
matrix_size = (width, height)
Edit 2
I measured metrices suggested in comments (filled to the full). This is how I measured memory usage.
size 500x500
matrix
empty size
full size
filling time
csr_matrix
2856
2992
477.67 s
doc_matrix
726
35807578
3.15 s
lil_matrix
8840
8840
0.54 s
size 1000x1000
matrix
empty size
full size
filling time
csr_matrix
4856
4992
7164.94 s
doc_matrix
726
150578858
12.81 s
lil_matrix
16840
16840
2.19 s
Probably the best solution would be to use lil_matrix
Related
I want to initialize 300,000 x 300,0000 sparse matrix using sklearn, but it requires memory as if it was not sparse:
>>> from scipy import sparse
>>> sparse.rand(300000,300000,.1)
it gives the error:
MemoryError: Unable to allocate 671. GiB for an array with shape (300000, 300000) and data type float64
which is the same error as if I initialize using numpy:
np.random.normal(size=[300000, 300000])
Even when I go to a very low density, it reproduces the error:
>>> from scipy import sparse
>>> from scipy import sparse
>>> sparse.rand(300000,300000,.000000000001)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../python3.8/site-packages/scipy/sparse/construct.py", line 842, in rand
return random(m, n, density, format, dtype, random_state)
File ".../lib/python3.8/site-packages/scipy/sparse/construct.py", line 788, in random
ind = random_state.choice(mn, size=k, replace=False)
File "mtrand.pyx", line 980, in numpy.random.mtrand.RandomState.choice
File "mtrand.pyx", line 4528, in numpy.random.mtrand.RandomState.permutation
MemoryError: Unable to allocate 671. GiB for an array with shape (90000000000,) and data type int64
Is there a more memory-efficient way to create such a sparse matrix?
Just generate only what you need.
from scipy import sparse
import numpy as np
n, m = 300000, 300000
density = 0.00000001
size = int(n * m * density)
rows = np.random.randint(0, n, size=size)
cols = np.random.randint(0, m, size=size)
data = np.random.rand(size)
arr = sparse.csr_matrix((data, (rows, cols)), shape=(n, m))
This lets you build monster sparse arrays provided they're sparse enough to fit into memory.
>>> arr
<300000x300000 sparse matrix of type '<class 'numpy.float64'>'
with 900 stored elements in Compressed Sparse Row format>
This is probably how the sparse.rand constructor should be working anyway. If any row, col pairs collide it'll add the data values together, which is probably fine for all applications I can think of.
Try passing a reasonable density argument as seen in the docs... if you have like 10 trillion cells maybe like 0.00000001 or something...
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.rand.html#scipy.sparse.rand
#hpaulj's comment is spot on. There is a clue in the error message also.
MemoryError: Unable to allocate 671. GiB for an array with shape (90000000000,) and data type int64
There is a reference to int64 and not float64 and a linear array of size 300,000 X 300,000. This refers to an intermediate step of random sampling in the creation of the sparse matrix, which occupies a lot of memory anyway.
Note that while creating any sparse matrix (irrespective of the format), you have to account memory for the non-zero values and for representing the position of the values in the matrix.
I have a .npz file which I want to load into RAM . The compressed file size is 30MB . I am doing the following operation to load the data into RAM.
import numpy as np
from scipy import sparse
from sys import getsizeof
a = sparse.load_npz('compressed/CRS.npz').todense()
getsizeof(a)
# 136
type(a)
# numpy.matrixlib.defmatrix.matrix
b = np.array(a)
getsizeof(b)
# 64000112
type(b)
# numpy.ndarray
Why numpy.matrix object occupy very low memory size compared to numpy.arrray ? Both a and b have same dimension and data.
Your a matrix is a view of another array, so the underlying data is not counted towards its getsizeof. You can see this by checking that a.base is not None, or by seeing that the OWNDATA flag is False in a.flags.
Your b array is not a view, so the underlying data is counted towards its getsizeof.
numpy.matrix doesn't provide any memory savings.
I'm trying to solve a Markov chain problem in which the transition matrix contains about ~150,000 rows and columns, which is however sparse (only about ~450,000 elements are nonzero).
I notice that trying to construct a csr_matrix matrix from a np.zeros array of that size leads to a Killed: 9 error:
In [139]: N = 150000
In [140]: T = np.zeros((N, N))
In [142]: import scipy.sparse
In [143]: _T = scipy.sparse.csr_matrix(T)
Killed: 9
Is it possible to construct a csr_matrix of this size? Do I need to initiate the matrix T as a csr_matrix and dispense with NumPy arrays altogether?
Your process is "killed: 9" mostly because the process is taking too long or too much memory of the system and it's been terminated by the os. Just like in the comment, you can construct a sparse matrix directly using csr_matrix:
_T = scipy.sparse.csr_matrix((N,N))
I am working with binary (only 0's and 1's) matrices of rows and columns in the order of a few thousands. For example, the number of rows are between 2000 - 7000 and number of columns are between 4000 - 15000. My computer has more then 100g RAM.
I'm surprised that even with these sizes, I am getting MemoryError with the following code. For reproducibility, I'm including an example with a smaller matrix (10*20) Note than both of the following raise this error:
import numpy as np
my_matrix = np.random.randint(2,size=(10,20))
tr, tc = np.triu_indices(my_matrix.shape[0],1)
ut_sums = np.sum(my_matrix[tr] * my_matrix[tc], 1)
denominator = 100
value = 1 - ut_sums.astype(float)/denominator
np.einsum('i->', value)
I tried to replace the elementwise multiplication in the above code to einsum as below, but it also generates the same MemoryError:
import numpy as np
my_matrix = np.random.randint(2,size=(10,20))
tr, tc = np.triu_indices(my_matrix.shape[0],1)
ut_sums = np.einsum('ij,ij->i', my_matrix[tr], my_matrix[tc])
denominator = 100
value = 1 - ut_sums.astype(float)/denominator
np.einsum('i->', value)
In both cases, the printed Traceback points to the line where ut_sums is being calculated.
Please note that my code has other operations too, and there are other statistics calculated on matrices of similar sizes, but with more than 100 g, I thought it should not be a problem.
Just because your computer has 100 GB of physical memory does not mean that your operating system is willing or able to allocate such large amounts of contiguous memory. And it does have to be contiguous, because that's how NumPy arrays usually are.
You should figure out how large your output matrix is meant to be, and then try creating a similar one by itself:
arr = np.zeros((10000, 10000))
See if you're able to allocate a single array as large as you want.
The code is too complicated to paste here, but I have a numpy array shaped (800, 800, 1300), or 1300 matrices shaped (800, 800). This is 5GB.
I pass this array into a function, whereby the function
multiplies each "matrix" in the above array by a float in a (1300,) shaped array
sums the array into one "matrix", shaped (800, 800)
and takes the inverse of the matrix
This program runs at 20.2 GB RAM! Is that possible? I cannot see any memory leaks. I am simply taking numpy arrays, and passing them through a function. I then save the resulting arrays.
I'll try to post the code.
import math
import matplotlib.pyplot as plt
import numpy as np
import scipy
import scipy.io
import os
data_file1 = "filename1.npy"
data_file2 = "filename2.npy"
data_file3 = "filename3.npy"
data1 = np.load(data_file1)
data2 = np.load(data_file2)
data3 = np.load(data_file3)
data_total = np.concatenate((data1, data2, data3)) # This array is shape (800,800,1300), around 6 GB.
array1 = np.arange(1300) + 1
vector = np.arange(800) + 1
def function_matrix(data_total, vector):
Multi_matrix = array1[:, None, None] * data_total # step 1, multiplies each (800,800) matrix
Sum_matrix = np.sum(Multi_matrix, axis=0) #sum matrix
mTCm = np.array([np.dot(vector.T , (np.linalg.solve(Sum_matrix , vector)) )])
return mTCm
draw_pointsA = np.asarray([[function_matrix(data_total[i], vector[j]) for i in np.arange(0,100)] for j in np.arange(0,100)])
filename = "save_datapoints.npy"
np.save(filename, draw_pointsA)
EDIT 2:
See below. It is actually 12 GB RAM, 20.1 GB virtual size of process.
This doesn't answer your question, but proposes a way to avoid the problem from the start.
Step 1 is sequential -- you only need 1 matrix loaded at a time.
Change your code to process each matrix independently
By Step 2 your memory requirement is down to 800 * 800 * sizeof(datum), which is a few megabytes, and you can certainly afford to keep that in memory.
It sounds like this could be a type issue, i.e. you converted the values in the matrices to a different type. Perhaps you stored the original matrix with values as int16 or a single, and after multiplying it with a float, it's stored as a matrix of double values (which require 2 times more space in memory).
You can use the dtype argument to set the value type for the matrix.
Other possible reasons could be that some additional matrices are created underway. That's obviously impossible to decode unless you post the code.
A possible solution to your memory problem is to use HDF5 files, and write the matrices to disk. Then you could load the matrix one at a time. This is easy with h5py, as the matrices can be compressed, and/or sliced using numpy/scipy syntax.