RAM usage in dealing with numpy arrays and Python lists - python

I have memory issues and can't understand why. I'm using Google Colab, that gives me 12GB of RAM and let me see how the RAM usage is.
I'm reading np.array from files, and loading each array in a list.
database_list = list()
for filename in glob.glob('*.npy'):
temp_img = np.load(filename)
temp_img = temp_img.reshape((-1, 64)).astype('float32')
temp_img = cv2.resize(temp_img, (64, 3072), interpolation=cv2.INTER_LINEAR)
database_list.append(temp_img)
The code print("INTER_LINEAR: %d bytes" % (sys.getsizeof(database_list))) prints:
INTER_LINEAR: 124920 bytes
It is the same value for arrays reshaped as 64x64, 512x64, 1024x64, 2048x64 and for 3072x64. But if I reshape these arrays as 4096x64, I get an error, for too much RAM used.
With arrays of 3072x64 I can see the RAM usage get higher and higher and then going back down.
My final goal is to zero-padding each array to a dimension of 8192x64, but my session crash before; but this is another problem.
How is the RAM used? Why, if the arrays have different dimensions, the list has the same size? How python is loading and manipulating this file, that explains the RAM usage history?
EDIT:
Does then
sizeofelem = database_list[0].nbytes
#all arrays have now the same dimensions MxN, so despite its content, they should occupy the same memory
total_size = sizeofelem * len(database_list)
work and total_sizereflects the correct size of the list?

I've got the solution.
First of all, as Dan Mašek pointed out, I'm measuring the memory used by the array, which is a collection of pointers (roughly said). To measure the real memory usage:
(database_list[0].nbytes * len(database_list) / 1000000, "MB")
where database_list[0].nbytes is reliable as all the elements in database_list have the same size. To be more precise, I should add the array metadata and eventually all data linked to it (if, for example, I'm storing in the array other structures).
To have less impact on memory, I should know the type of data that I'm reading, that is values in range 0-65535, so:
database_list = list()
for filename in glob.glob('*.npy'):
temp_img = np.load(filename)
temp_img = temp_img.reshape((-1, 64)).astype(np.uint16)
database_list.append(temp_img)
Moreover, if I do some calculations on the data stored in database_list, for example, normalization of values in the range 0-1 like database_list = database_list/ 65535.0 (NB: database_list, as a list, does not support that operation), I should do another cast, because Python cast the type to something like float64.

Related

Python h5py - efficient access of arrays of ragged arrays

I have a large h5py file with several ragged arrays in a large dataset. The arrays have one of the following types:
# Create types of lists of variable length vectors
vardoub = h5py.special_dtype(vlen=np.dtype('double'))
varint = h5py.special_dtype(vlen=np.dtype('int8'))
Within an HDF5 group (grp), I create datasets of N jagged items, e.g.:
d = grp.create_dataset("predictions", (N,), dtype=vardoub)
and populate d[0], d[1], ..., d[N-1] with long numpy arrays (usually in the hundreds of millions).
Creating these arrays works well, my issue is related to access. If I want to access a slice from one of the arrays, e.g. d[0][5000:6000] or d[0][50, 89, 100], the memory usage goes through the roof and I believe that it is reading in large sections of the array; I can watch physical memory usage rise from 5-6 GB to 32 GB (size of RAM on the machine) very quickly. p = d[0] reads the whole array into memory, so I think this is happening and then it is indexing into it.
Is there a better way to do this? d[n]'s type is a numpy array and I cannot take a ref of it. I suspect that I could restructure the data so that I have groups for each of the indices e.g. '0/predictions', '1/predictions', ..., but I would prefer not to have to convert this if there is a reasonable alternative.
Thank you,
Marie

How to create huge sparse matrix with dtype=float16?

I've tried all of these and had either memory error or some kind of other error.
Matrix1 = csc_matrix((130000,130000)).todense()
Matrix1 = csc_matrix((130000,130000), dtype=float_).todense()
Matrix1 = csc_matrix((130000,130000), dtype=float16).todense()
How can I create a huge sparse matrix with float type of data?
To create a huge sparse matrix, just do exactly what you're doing:
Matrix1 = csc_matrix((130000,130000), dtype=float16)
… without calling todense() at the end. This succeeds, and takes a tiny amount of memory.1
When you add todense(), that successfully creates a huge sparse array that takes a tiny amount of memory, and then tries to convert that to a dense array that takes a huge amount of memory, which fails with a MemoryError. But the solution to that is just… don't do that.
And likewise, if you use dtype=float_ instead of dtype=float16, you get float64 values (which aren't what you want, and take 4x the memory), but again, the solution is just… don't do that.
1. sys.getsizeof(m) gives 56 bytes for the sparse array handle, sys.getsizeof(m.data) gives 96 bytes for the internal storage handle, and m.data.nbytes gives 0 bytes for the actual storage, for a grand total of 152 bytes. Which is unlikely to raise a MemoryError.

Most orthodox way of splitting matrix and list in Python

If I have ratios to split a dataset into training, validation, and test sets, what is the most orthodox and elegant way of doing this in Python?
For instance, I split my data into 60% training, 20% testing, and 20% validation. I have 1000 rows of data with 10 features each, and a label vector of size 1000. The training set matrix should be of size (600, 10), and so on.
If I create new matrices of features and lists of labels, it wouldn't be memory efficient right? Lets say I did something like this:
TRAIN_PORTION = int(datasetSize * tr)
VALIDATION_PORTION = int(datasetSize * va)
# Whatever is left will be for testing
TEST_PORTION = datasetSize - TRAIN_PORTION - VALIDATION_PORTION
trainingSet = dataSet[0, TRAIN_PORTION:]
validationSet = dataSet[TRAIN_PORTION,
TRAIN_PORTION + VALIDATIONPORTION:]
testSet = dataset[TRAIN_PORTION+VALIATION_PORTION, datasetSize:]
That would leave me with the double amount of used memory, right?
Sorry for the incorrect Python syntax, and thank you for any help.
That's correct: you will double the memory usage that way. To avoid doubling the memory usage, you need to do one of two things:
Release the memory from one sub-matrix before you create the next; this reduces your memory high-water mark to 1.6x the main matrix;
Write your processing routines to stop at the proper row, always working on the original matrix.
You can achieve the first one by passing list slices to your processing routines, such as
model_test(data_set[:TRAIN_PORTION])
Remember, when you refer to a slice, the interpreter will build a temporary object that results from the given limits.
RESPONSE TO OP COMMENT
The reference I gave you does create a new list. To avoid using more memory, pass the entire list and the desired limits, such as
process_function(data_set, 0, TRAIN_PORTION)
process_function(data_set, TRAIN_PORTION,
TRAIN_PORTION + VALIDATION_PORTION)
process_function(data_set,
TRAIN_PORTION + VALIDATION_PORTION,
len(data_set))
If you want to do this with just list slices, then please explain where you're having trouble, and why the various pieces of documentation and the tutorials aren't satisfying your needs.
If you would use numpy-arrays (your code actually looks like that), it's possible to use views (memory is shared). It's not always easy to understand which operation results in a view and which does not. Here are some hints.
Short example:
import numpy as np
a = np.random.normal(size=(1000, 10))
b = a[:600]
print (b.flags['OWNDATA'])
# False
print(b[3,2])
# 0.373994992467 (some random-val)
a[3,2] = 88888888
print(b[3,2])
# 88888888.0
print(a.shape)
# (1000, 10)
print(b.shape)
# (600, 10)
This will probably allow you to do some in-place shuffle at the beginning and then use those linear-segments of your data to obtain views of train, val, test.

Python Numpy integers high memory use

I'm trying to store 25 million integers efficiently in Python. Preferably using Numpy so I can perform operations on them afterwards.
The idea is to have them stored as 4-bytes unsigned integers. So the required memory should be 25M entries * 4 bytes = 95 MB.
I have written the following test code and the reported memory consumption is almost 700 MB, why?
a = np.array([], dtype=np.uint32)
# TEST MEMORY FOR PUTTING 25 MILLION INTEGERS IN MEMORY
for i in range(0, 25000000):
np.append(a, np.asarray([i], dtype=np.uint32))
If I do this for example, it works as expected:
a = np.random.random_integers(1, 25000000, size=25000000)
Why?
Actually the problem is range(0, 25000000) because this creates a list composed of ints.
The memory needed for containing such a list is (for simplicity assuming 32bytes per integer) 25kk * 32B = 800kkB ~ 762MB.
Use the generator-like xrange or update to python3 there range is a bit less memory expensive (because there the values are not precomputed but evaluated when needed).
The actual numpy array is always empty (since you only create a temporary copy with np.append - the result of the append is not stored and therefore discarded directly afterwards!) and therefore negligable.
I would work with your a = np.random.random_integers(1, 25000000, size=25000000) and convert it (if you want) to np.uint afterwards.

Big Satellite Image Processing

Im trying to run Mort Canty's http://mcanty.homepage.t-online.de/ Python iMAD implementation on bitemporal RapidEye Multispectral images. Which basically calculates the canonical correlation for the two images and then substracts them. The problem I'm having is
that the images are of 5000 x 5000 x 5 (bands) pixels. If I try to run this on
the whole image I get a memory error.
Would the use of something like pyTables help me with this?
What Mort Canty's code tries to do is that it loads the images using gdal and then stores them
in an 10 x 25,000,000 array.
# initial weights
wt = ones(cols*rows)
# data array (transposed so observations are columns)
dm = zeros((2*bands,cols*rows))
k = 0
for b in pos:
band1 = inDataset1.GetRasterBand(b+1)
band1 = band1.ReadAsArray(x0,y0,cols,rows).astype(float)
dm[k,:] = ravel(band1)
band2 = inDataset2.GetRasterBand(b+1)
band2 = band2.ReadAsArray(x0,y0,cols,rows).astype(float)
dm[bands+k,:] = ravel(band2)
k += 1
Even just creating a 10 x 25,000,000 numpy array of floats throws a memory error. Anyone have a good idea of how to get around this? This is my first post ever so any advice on how to post would also be welcome.
Greetings
numpy uses float64 per default, so your dm-array takes up 2GB of memory (8*10*25000000), the other arrays probably about 200MB (~8*5000*5000) each.
astype(float) returns a new array, so you need memory for that as well - and is probably not even needed as the type is implicitly converted when copying the data to the result array.
when the memory used in the for-loop is freed depends on garbage collection. and this doesn't consider the memory overhead of GetRasterBand, ReadAsArray.
are your sure your input data uses 64-bit floats? if it uses 32-bit floats, you could easyliy half the memory usage by specifying dtype='f' on your arrays.

Categories

Resources