Im trying to run Mort Canty's http://mcanty.homepage.t-online.de/ Python iMAD implementation on bitemporal RapidEye Multispectral images. Which basically calculates the canonical correlation for the two images and then substracts them. The problem I'm having is
that the images are of 5000 x 5000 x 5 (bands) pixels. If I try to run this on
the whole image I get a memory error.
Would the use of something like pyTables help me with this?
What Mort Canty's code tries to do is that it loads the images using gdal and then stores them
in an 10 x 25,000,000 array.
# initial weights
wt = ones(cols*rows)
# data array (transposed so observations are columns)
dm = zeros((2*bands,cols*rows))
k = 0
for b in pos:
band1 = inDataset1.GetRasterBand(b+1)
band1 = band1.ReadAsArray(x0,y0,cols,rows).astype(float)
dm[k,:] = ravel(band1)
band2 = inDataset2.GetRasterBand(b+1)
band2 = band2.ReadAsArray(x0,y0,cols,rows).astype(float)
dm[bands+k,:] = ravel(band2)
k += 1
Even just creating a 10 x 25,000,000 numpy array of floats throws a memory error. Anyone have a good idea of how to get around this? This is my first post ever so any advice on how to post would also be welcome.
Greetings
numpy uses float64 per default, so your dm-array takes up 2GB of memory (8*10*25000000), the other arrays probably about 200MB (~8*5000*5000) each.
astype(float) returns a new array, so you need memory for that as well - and is probably not even needed as the type is implicitly converted when copying the data to the result array.
when the memory used in the for-loop is freed depends on garbage collection. and this doesn't consider the memory overhead of GetRasterBand, ReadAsArray.
are your sure your input data uses 64-bit floats? if it uses 32-bit floats, you could easyliy half the memory usage by specifying dtype='f' on your arrays.
Related
I have memory issues and can't understand why. I'm using Google Colab, that gives me 12GB of RAM and let me see how the RAM usage is.
I'm reading np.array from files, and loading each array in a list.
database_list = list()
for filename in glob.glob('*.npy'):
temp_img = np.load(filename)
temp_img = temp_img.reshape((-1, 64)).astype('float32')
temp_img = cv2.resize(temp_img, (64, 3072), interpolation=cv2.INTER_LINEAR)
database_list.append(temp_img)
The code print("INTER_LINEAR: %d bytes" % (sys.getsizeof(database_list))) prints:
INTER_LINEAR: 124920 bytes
It is the same value for arrays reshaped as 64x64, 512x64, 1024x64, 2048x64 and for 3072x64. But if I reshape these arrays as 4096x64, I get an error, for too much RAM used.
With arrays of 3072x64 I can see the RAM usage get higher and higher and then going back down.
My final goal is to zero-padding each array to a dimension of 8192x64, but my session crash before; but this is another problem.
How is the RAM used? Why, if the arrays have different dimensions, the list has the same size? How python is loading and manipulating this file, that explains the RAM usage history?
EDIT:
Does then
sizeofelem = database_list[0].nbytes
#all arrays have now the same dimensions MxN, so despite its content, they should occupy the same memory
total_size = sizeofelem * len(database_list)
work and total_sizereflects the correct size of the list?
I've got the solution.
First of all, as Dan Mašek pointed out, I'm measuring the memory used by the array, which is a collection of pointers (roughly said). To measure the real memory usage:
(database_list[0].nbytes * len(database_list) / 1000000, "MB")
where database_list[0].nbytes is reliable as all the elements in database_list have the same size. To be more precise, I should add the array metadata and eventually all data linked to it (if, for example, I'm storing in the array other structures).
To have less impact on memory, I should know the type of data that I'm reading, that is values in range 0-65535, so:
database_list = list()
for filename in glob.glob('*.npy'):
temp_img = np.load(filename)
temp_img = temp_img.reshape((-1, 64)).astype(np.uint16)
database_list.append(temp_img)
Moreover, if I do some calculations on the data stored in database_list, for example, normalization of values in the range 0-1 like database_list = database_list/ 65535.0 (NB: database_list, as a list, does not support that operation), I should do another cast, because Python cast the type to something like float64.
I've tried all of these and had either memory error or some kind of other error.
Matrix1 = csc_matrix((130000,130000)).todense()
Matrix1 = csc_matrix((130000,130000), dtype=float_).todense()
Matrix1 = csc_matrix((130000,130000), dtype=float16).todense()
How can I create a huge sparse matrix with float type of data?
To create a huge sparse matrix, just do exactly what you're doing:
Matrix1 = csc_matrix((130000,130000), dtype=float16)
… without calling todense() at the end. This succeeds, and takes a tiny amount of memory.1
When you add todense(), that successfully creates a huge sparse array that takes a tiny amount of memory, and then tries to convert that to a dense array that takes a huge amount of memory, which fails with a MemoryError. But the solution to that is just… don't do that.
And likewise, if you use dtype=float_ instead of dtype=float16, you get float64 values (which aren't what you want, and take 4x the memory), but again, the solution is just… don't do that.
1. sys.getsizeof(m) gives 56 bytes for the sparse array handle, sys.getsizeof(m.data) gives 96 bytes for the internal storage handle, and m.data.nbytes gives 0 bytes for the actual storage, for a grand total of 152 bytes. Which is unlikely to raise a MemoryError.
If I have ratios to split a dataset into training, validation, and test sets, what is the most orthodox and elegant way of doing this in Python?
For instance, I split my data into 60% training, 20% testing, and 20% validation. I have 1000 rows of data with 10 features each, and a label vector of size 1000. The training set matrix should be of size (600, 10), and so on.
If I create new matrices of features and lists of labels, it wouldn't be memory efficient right? Lets say I did something like this:
TRAIN_PORTION = int(datasetSize * tr)
VALIDATION_PORTION = int(datasetSize * va)
# Whatever is left will be for testing
TEST_PORTION = datasetSize - TRAIN_PORTION - VALIDATION_PORTION
trainingSet = dataSet[0, TRAIN_PORTION:]
validationSet = dataSet[TRAIN_PORTION,
TRAIN_PORTION + VALIDATIONPORTION:]
testSet = dataset[TRAIN_PORTION+VALIATION_PORTION, datasetSize:]
That would leave me with the double amount of used memory, right?
Sorry for the incorrect Python syntax, and thank you for any help.
That's correct: you will double the memory usage that way. To avoid doubling the memory usage, you need to do one of two things:
Release the memory from one sub-matrix before you create the next; this reduces your memory high-water mark to 1.6x the main matrix;
Write your processing routines to stop at the proper row, always working on the original matrix.
You can achieve the first one by passing list slices to your processing routines, such as
model_test(data_set[:TRAIN_PORTION])
Remember, when you refer to a slice, the interpreter will build a temporary object that results from the given limits.
RESPONSE TO OP COMMENT
The reference I gave you does create a new list. To avoid using more memory, pass the entire list and the desired limits, such as
process_function(data_set, 0, TRAIN_PORTION)
process_function(data_set, TRAIN_PORTION,
TRAIN_PORTION + VALIDATION_PORTION)
process_function(data_set,
TRAIN_PORTION + VALIDATION_PORTION,
len(data_set))
If you want to do this with just list slices, then please explain where you're having trouble, and why the various pieces of documentation and the tutorials aren't satisfying your needs.
If you would use numpy-arrays (your code actually looks like that), it's possible to use views (memory is shared). It's not always easy to understand which operation results in a view and which does not. Here are some hints.
Short example:
import numpy as np
a = np.random.normal(size=(1000, 10))
b = a[:600]
print (b.flags['OWNDATA'])
# False
print(b[3,2])
# 0.373994992467 (some random-val)
a[3,2] = 88888888
print(b[3,2])
# 88888888.0
print(a.shape)
# (1000, 10)
print(b.shape)
# (600, 10)
This will probably allow you to do some in-place shuffle at the beginning and then use those linear-segments of your data to obtain views of train, val, test.
I'm trying to store 25 million integers efficiently in Python. Preferably using Numpy so I can perform operations on them afterwards.
The idea is to have them stored as 4-bytes unsigned integers. So the required memory should be 25M entries * 4 bytes = 95 MB.
I have written the following test code and the reported memory consumption is almost 700 MB, why?
a = np.array([], dtype=np.uint32)
# TEST MEMORY FOR PUTTING 25 MILLION INTEGERS IN MEMORY
for i in range(0, 25000000):
np.append(a, np.asarray([i], dtype=np.uint32))
If I do this for example, it works as expected:
a = np.random.random_integers(1, 25000000, size=25000000)
Why?
Actually the problem is range(0, 25000000) because this creates a list composed of ints.
The memory needed for containing such a list is (for simplicity assuming 32bytes per integer) 25kk * 32B = 800kkB ~ 762MB.
Use the generator-like xrange or update to python3 there range is a bit less memory expensive (because there the values are not precomputed but evaluated when needed).
The actual numpy array is always empty (since you only create a temporary copy with np.append - the result of the append is not stored and therefore discarded directly afterwards!) and therefore negligable.
I would work with your a = np.random.random_integers(1, 25000000, size=25000000) and convert it (if you want) to np.uint afterwards.
I have very long arrays and tables of time-value pairs in pytables. I need to be able to perform linear interpolation and zero order hold interpolation on this data.
Currently, I'm turning the columns into numpy arrays using pytables' column-wise slice notation and then feeding the numpy arrays to scipy.interpolate.interp1d to create the interpolation functions.
Is there a better way to do this?
The reason I ask is that it is my understanding that turning the columns into numpy arrays basically copies them into memory. Which means that when I start running my code full throttle I'm going to be in trouble since I will be working with data sets large enough to drown my desktop. Please correct me if I'm mistaken on this point.
Also, due to the large amounts of data I'll be working with, I suspect that writing a function that iterates over the pytables arrays/tables in order to do the interpolation myself will be incredibly slow since I need to call the interpolation function many, many times (about as many times as there are records in the data I'm trying to interpolate).
Your question is difficult to answer because there is always a trade off between memory and computation time and you are essentially asking to not have to sacrifice either of them, which is impossible. scipy.interpolate.interp1d() requires that the arrays be in memory and writing an out-of-core interpolator requires that you query the disk linearly with the number of times that you call it.
That said, there are a couple of things that you can do, none of which are perfect.
The first thing that you can try is down sampling the data. This will cut down the data that you need to have in memory by the factor that you down sample. The disadvantage is that your interpolation is that much coarser. Luckily this is pretty easy to do. Just provide a step size to the columns that you access. For down sampling factor of 4 you would do:
with tb.open_file('myfile.h5', 'r') as f:
x = f.root.mytable.cols.x[::4]
y = f.root.mytable.cols.y[::4]
f = scipy.interpolate.interp1d(x, y)
ynew = f(xnew)
You could make this step size adjustable based on the memory available if you wanted to as well.
Alternatively, if the data set that you are interpolating values for - xnew - exists only on a subset of the original domain, you can get away with reading in only portions of the original table that are in the new neighborhood. Given a fudge factor of 10%, you would do something like the following:
query = "{0} <= x & x <= {1}".format(xnew.min()*0.9, xnew.max()*1.1)
with tb.open_file('myfile.h5', 'r') as f:
data = f.root.mytable.read_where(query)
f = scipy.interpolate.interp1d(data['x'], data['y'])
ynew = f(xnew)
Extending this idea, if we have the case where xnew sorted (monotonically increasing) but does extend over the entire original domain, then you can read in from the table on disk in a chunked fashion. Say we want to have 10 chunks:
newlen = len(xnew)
chunks = 10
chunklen = newlen/ chunks
ynew = np.empty(newlen, dtype=float)
for i in range(chunks):
xnew_chunk = xnew[i*chunklen:(i+1)*chunklen]
query = "{0} <= x & x <= {1}".format(xnew_chunklen.min()*0.9,
xnew_chunklen.max()*1.1)
with tb.open_file('myfile.h5', 'r') as f:
data = f.root.mytable.read_where(query)
f = scipy.interpolate.interp1d(data['x'], data['y'])
ynew[i*chunklen:(i+1)*chunklen] = f(xnew_chunk)
Striking the balance between memory and I/O speed is always a challenge. There are probably things that you can do to speed these strategies up depending on how regular your data is. Still, this should be enough to get you started.