I have a .npz file which I want to load into RAM . The compressed file size is 30MB . I am doing the following operation to load the data into RAM.
import numpy as np
from scipy import sparse
from sys import getsizeof
a = sparse.load_npz('compressed/CRS.npz').todense()
getsizeof(a)
# 136
type(a)
# numpy.matrixlib.defmatrix.matrix
b = np.array(a)
getsizeof(b)
# 64000112
type(b)
# numpy.ndarray
Why numpy.matrix object occupy very low memory size compared to numpy.arrray ? Both a and b have same dimension and data.
Your a matrix is a view of another array, so the underlying data is not counted towards its getsizeof. You can see this by checking that a.base is not None, or by seeing that the OWNDATA flag is False in a.flags.
Your b array is not a view, so the underlying data is counted towards its getsizeof.
numpy.matrix doesn't provide any memory savings.
Related
I have a large .tiff file (4.4gB, 79530 x 54980 values) with 1 band. Since only 16% of the values are valid, I was thinking it's better to import the file as sparse matrix, to save RAM. When I first open it as np.array and then transform it into a sparse matrix using csr_matrix(), my kernel already crashes. See code below.
from osgeo import gdal
import numpy as np
from scipy.sparse import csr_matrix
ds = gdal.Open("file.tif")
band = ds.GetRasterBand(1)
array = np.array(band.ReadAsArray())
csr_matrix(array)
Is there a better way to work with this file? In the end I have to make calculations based on the values in the raster. (Unfortunately, due to confidentiality, I cannot attach the relevant file.)
Can you tell where the crash occurs?
band = ds.GetRasterBand(1)
temp = band.ReadAsArray()
array = np.array(temp) # if temp is already an array, you don't need this
csr_matrix(array)
If array is 4.4gB, (79530, 54980)
In [62]: (79530 * 54980) / 1e9
Out[62]: 4.3725594 # 4.4gB makes sense for 1 byte/element
In [63]: (79530 * 54980) * 0.16 # 16% density
Out[63]: 699609504.0 # number of nonzero values
creating csr requires doing np.nonzero(array) to get the indices. That will produce 2 arrays of this 0.7 * 8 Gb size (indices are 8 byte ints). coo format actually requires those 2 arrays plus 0.7 for the nonzero values - about 12 Gb . Converted to csr, the row attribute is reduced to 79530 elements - so about 7 Gb . (corrected for 8 bytes/element)
So at 16% density, the sparse format is, at it's best, is still larger than the dense version.
Memory error when converting matrix to sparse matrix, specified dtype is invalid
is a recent case of a memory error - which occurred in nonzero step.
Assuming you know size of your matrix, you can create an empty sparse matrix, and then set only valid values one-by-one.
from osgeo import gdal
import numpy as np
from scipy.sparse import csr_matrix
ds = gdal.Open("file.tif")
band = ds.GetRasterBand(1)
matrix_size = (1000, 1000) # set you size
matrix = csr_matrix(matrix_size)
# for each valid value
matrix[i, j] = your_value
Edit 1
If you don't know size of your matrix, you should be able to check it like this:
from osgeo import gdal
ds = gdal.Open("file.tif")
width = ds.GetRasterXSize()
height = ds.GetRasterYSize()
matrix_size = (width, height)
Edit 2
I measured metrices suggested in comments (filled to the full). This is how I measured memory usage.
size 500x500
matrix
empty size
full size
filling time
csr_matrix
2856
2992
477.67 s
doc_matrix
726
35807578
3.15 s
lil_matrix
8840
8840
0.54 s
size 1000x1000
matrix
empty size
full size
filling time
csr_matrix
4856
4992
7164.94 s
doc_matrix
726
150578858
12.81 s
lil_matrix
16840
16840
2.19 s
Probably the best solution would be to use lil_matrix
I have a precomputed numpy array that takes up just under 9.5 GB. I have saved it as both an npy file, and using h5py an hdf5 file. Although I can read this array in, using either format when working interactively with an interpreter, however, when I read it in when actually running a module, I get a "Memory Error":
File "/usr/local/lib/python2.7/dist-packages/h5py/_hl/dataset.py", line 440, in __getitem__
arr = numpy.ndarray(mshape, new_dtype, order='C')
MemoryError
This happens whether I save/read an npy file or an hdf5 file.
I have tried using numpy.memmap so I could substitute disk memory for RAM, but do not seem to be able to read the array accurately:
>>> import numpy as np
>>> zz=np.load('VGG16_l19_val.npy')
>>> zz.dtype
dtype('float64')
>>> zz.shape
(50000, 25088)
# So, I've read in the array using np.load and know its dtype and shape
>>> from numpy import unravel_index
>>> unravel_index(zz.argmax(), zz.shape)
(41232, 8208)
>>> zz[41232,8208]
937.5606689453125
# I now know the max value of zz and where it occurs
>>> zz2=np.memmap('VGG16_l19_val.npy', mode = 'r', dtype=np.float64, shape= (50000,25088))
>>> zz2.dtype
dtype('float64')
>>> zz2.shape
(50000, 25088)
# I've read a memmap version of the array and have the correct dtype and shape, but ...
>>> zz2[41232,8208]
0.0
>>> zz2.max()
memmap(8.447400968892931e+252)
>>>
# It doesn't appear that zz2 == zz
What don't I understand about np.memmap? Can I use it to read in this numpy array?
If not, what should I do, other than break up the array and save it in several files?
Why can I read the array without a problem when I'm in the interpreter, or in pdb, but can't read it without a MemoryError when I read it within a module?
The code is too complicated to paste here, but I have a numpy array shaped (800, 800, 1300), or 1300 matrices shaped (800, 800). This is 5GB.
I pass this array into a function, whereby the function
multiplies each "matrix" in the above array by a float in a (1300,) shaped array
sums the array into one "matrix", shaped (800, 800)
and takes the inverse of the matrix
This program runs at 20.2 GB RAM! Is that possible? I cannot see any memory leaks. I am simply taking numpy arrays, and passing them through a function. I then save the resulting arrays.
I'll try to post the code.
import math
import matplotlib.pyplot as plt
import numpy as np
import scipy
import scipy.io
import os
data_file1 = "filename1.npy"
data_file2 = "filename2.npy"
data_file3 = "filename3.npy"
data1 = np.load(data_file1)
data2 = np.load(data_file2)
data3 = np.load(data_file3)
data_total = np.concatenate((data1, data2, data3)) # This array is shape (800,800,1300), around 6 GB.
array1 = np.arange(1300) + 1
vector = np.arange(800) + 1
def function_matrix(data_total, vector):
Multi_matrix = array1[:, None, None] * data_total # step 1, multiplies each (800,800) matrix
Sum_matrix = np.sum(Multi_matrix, axis=0) #sum matrix
mTCm = np.array([np.dot(vector.T , (np.linalg.solve(Sum_matrix , vector)) )])
return mTCm
draw_pointsA = np.asarray([[function_matrix(data_total[i], vector[j]) for i in np.arange(0,100)] for j in np.arange(0,100)])
filename = "save_datapoints.npy"
np.save(filename, draw_pointsA)
EDIT 2:
See below. It is actually 12 GB RAM, 20.1 GB virtual size of process.
This doesn't answer your question, but proposes a way to avoid the problem from the start.
Step 1 is sequential -- you only need 1 matrix loaded at a time.
Change your code to process each matrix independently
By Step 2 your memory requirement is down to 800 * 800 * sizeof(datum), which is a few megabytes, and you can certainly afford to keep that in memory.
It sounds like this could be a type issue, i.e. you converted the values in the matrices to a different type. Perhaps you stored the original matrix with values as int16 or a single, and after multiplying it with a float, it's stored as a matrix of double values (which require 2 times more space in memory).
You can use the dtype argument to set the value type for the matrix.
Other possible reasons could be that some additional matrices are created underway. That's obviously impossible to decode unless you post the code.
A possible solution to your memory problem is to use HDF5 files, and write the matrices to disk. Then you could load the matrix one at a time. This is easy with h5py, as the matrices can be compressed, and/or sliced using numpy/scipy syntax.
I am trying to import a 1.25 GB dataset into python using dask.array
The file is a 1312*2500*196 Array of uint16's. I need to convert this to a float32 array for later processing.
I have managed to stitch together this Dask array in uint16, however when I try to convert to float32 I get a memory error.
It doesn't matter what I do to the chunk size, I will always get a memory error.
I create the array by concatenating the array in lines of 100 (breaking the 2500 dimension up into little pieces of 100 lines, since dask can't natively read .RAW imaging files I have to use numpy.memmap() to read the file and then create the array.
Below I will supply a "as short as possible" code snippet:
I have tried two methods:
1) Create the full uint16 array and then try to convert to float32:
(note: the memmap is a 1312x100x196 array and lines ranges from 0 to 24)
for i in range(lines):
NewArray = da.concatenate([OldArray,Memmap],axis=0)
OldArray = NewArray
return NewArray
and then I use
Float32Array = FinalArray.map_blocks(lambda FinalArray: FinalArray * 1.,dtype=np.float32)
In method 2:
for i in range(lines):
NewArray = da.concatenate([OldArray,np.float32(Memmap)],axis=0)
OldArray = NewArray
return NewArray
Both methods result in a memory error.
Is there any reason for this?
I read that dask array is capable of doing up to 100 GB dataset calculations.
I tried all chunk sizes (from as small as 10x10x10 to a single line)
You can create a dask.array from a numpy memmap array directly with the da.from_array function
x = load_memmap_numpy_array_from_raw_file(filename)
d = da.from_array(x, chunks=...)
You can change the dtype with the astype method
d = d.astype(np.float32)
I have a large scipy sparse matrix, which is taking up >90% of my total system memory. I would like to save it to disk, as it takes hours to build the matrix...
I tried cPickle, but that leads to a major memory explosion...
import numpy as np
from scipy.sparse import lil_matrix
import cPickle
dim = 10**8
M = lil_matrix((dim, dim), dtype=np.float)
with open(filename, 'wb') as f:
cpickle.dump(M, f) # leads to a major memory explosion, presumably there is lots of copying
while HDF5 didn't like the datatype: TypeError: Object dtype dtype('O') has no native HDF5 equivalent
So what should I do?
Pickling is very memory inefficient, unfortunately. I would recommend accessing the underlying data array attributes of the sparse matrix, and storing those in an efficient manner, such as hdf5. Reconstructing a sparse matrix from a triplet of row/column/data vectors should be easy.
It depends on how much data is actually stored in the matrix. Have you looked at converting the matrix type before serialisation?
The LIL matrix is not the most memory efficient sparse matrix you have available. You could look at converting to either DIA, COO or DOK before pickling.
For example:
In [43]: dim = 10**6
In [44]: M = lil_matrix((dim, dim), dtype=np.float)
In [45]: for ii in range(10000):
M[np.random.uniform(0,dim),np.random.uniform(0,dim)] = 1
In [46]: len(cPickle.dumps(M.todok()))
Out[46]: 1256302
In [47]: len(cPickle.dumps(M.tocoo()))
Out[47]: 557691
# compared to
In [48]: len(cPickle.dumps(M))
Out[48]: 23018393
These formats don't all support the same set of operations, but conversion between the formats is trivial.