Python Numpy integers high memory use

Python Numpy integers high memory use - python

I'm trying to store 25 million integers efficiently in Python. Preferably using Numpy so I can perform operations on them afterwards.
The idea is to have them stored as 4-bytes unsigned integers. So the required memory should be 25M entries * 4 bytes = 95 MB.
I have written the following test code and the reported memory consumption is almost 700 MB, why?
a = np.array([], dtype=np.uint32)
# TEST MEMORY FOR PUTTING 25 MILLION INTEGERS IN MEMORY
for i in range(0, 25000000):
np.append(a, np.asarray([i], dtype=np.uint32))
If I do this for example, it works as expected:
a = np.random.random_integers(1, 25000000, size=25000000)
Why?

Actually the problem is range(0, 25000000) because this creates a list composed of ints.
The memory needed for containing such a list is (for simplicity assuming 32bytes per integer) 25kk * 32B = 800kkB ~ 762MB.
Use the generator-like xrange or update to python3 there range is a bit less memory expensive (because there the values are not precomputed but evaluated when needed).
The actual numpy array is always empty (since you only create a temporary copy with np.append - the result of the append is not stored and therefore discarded directly afterwards!) and therefore negligable.
I would work with your a = np.random.random_integers(1, 25000000, size=25000000) and convert it (if you want) to np.uint afterwards.

Related

memory usage: numpy-arrays vs python-lists

Numpy is known for optimized arrays and various advantages over python-lists.
But when I check for the memory usage python-lists have less space than the numpy arrays.
The code I used is entered below.
Can anyone explain me why?
import sys
Z = np.zeros((10,10),dtype = int)
A = [[0] * 10] * 10
print(A,'\n',f'{sys.getsizeof(A)} bytes')
print(Z,'\n',f'{Z.size * Z.itemsize} bytes')

You're not measuring correctly; the native Python list only contains 10 references. You need to add in the collective size of the sub-lists as well:
>>> sys.getsizeof(A) + sum(map(sys.getsizeof, A))
1496
And it might get worse: each element inside the sub-lists could also be a reference (to an int). It's difficult to check whether the Python implementation is optimizing this away and storing the actual numbers inside the list.
You're also under-representing the size of the numpy array, because it includes a header:
>>> Z.size * Z.itemsize
800
>>> sys.getsizeof(Z)
912
In either case it's not an exact science and will depend on your platform and Python implementation.

According to the spec (https://docs.python.org/3/library/sys.html#sys.getsizeof), "only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to."
Also "getsizeof calls object’s " sizeof method
So you are given the size of just the container (a list object).
Please check https://code.activestate.com/recipes/577504/ for a complete size computation, which returns 296bytes for your example since only two unique objects are used. A list [0 0 0 0 0 0 0 0 0 0] and int 0.
If you initialize the list with different values, overall size will increase and will become bigger than np.array, which reserves 4bytes for numpy.int32 type elements, plus the size of its own internal data structure.
Find detailed info with examples here: https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html

RAM usage in dealing with numpy arrays and Python lists

I have memory issues and can't understand why. I'm using Google Colab, that gives me 12GB of RAM and let me see how the RAM usage is.
I'm reading np.array from files, and loading each array in a list.
database_list = list()
for filename in glob.glob('*.npy'):
temp_img = np.load(filename)
temp_img = temp_img.reshape((-1, 64)).astype('float32')
temp_img = cv2.resize(temp_img, (64, 3072), interpolation=cv2.INTER_LINEAR)
database_list.append(temp_img)
The code print("INTER_LINEAR: %d bytes" % (sys.getsizeof(database_list))) prints:
INTER_LINEAR: 124920 bytes
It is the same value for arrays reshaped as 64x64, 512x64, 1024x64, 2048x64 and for 3072x64. But if I reshape these arrays as 4096x64, I get an error, for too much RAM used.
With arrays of 3072x64 I can see the RAM usage get higher and higher and then going back down.
My final goal is to zero-padding each array to a dimension of 8192x64, but my session crash before; but this is another problem.
How is the RAM used? Why, if the arrays have different dimensions, the list has the same size? How python is loading and manipulating this file, that explains the RAM usage history?
EDIT:
Does then
sizeofelem = database_list[0].nbytes
#all arrays have now the same dimensions MxN, so despite its content, they should occupy the same memory
total_size = sizeofelem * len(database_list)
work and total_sizereflects the correct size of the list?

I've got the solution.
First of all, as Dan Mašek pointed out, I'm measuring the memory used by the array, which is a collection of pointers (roughly said). To measure the real memory usage:
(database_list[0].nbytes * len(database_list) / 1000000, "MB")
where database_list[0].nbytes is reliable as all the elements in database_list have the same size. To be more precise, I should add the array metadata and eventually all data linked to it (if, for example, I'm storing in the array other structures).
To have less impact on memory, I should know the type of data that I'm reading, that is values in range 0-65535, so:
database_list = list()
for filename in glob.glob('*.npy'):
temp_img = np.load(filename)
temp_img = temp_img.reshape((-1, 64)).astype(np.uint16)
database_list.append(temp_img)
Moreover, if I do some calculations on the data stored in database_list, for example, normalization of values in the range 0-1 like database_list = database_list/ 65535.0 (NB: database_list, as a list, does not support that operation), I should do another cast, because Python cast the type to something like float64.

Memory efficient sort of massive numpy array in Python

I need to sort a VERY large genomic dataset using numpy. I have an array of 2.6 billion floats, dimensions = (868940742, 3) which takes up about 20GB of memory on my machine once loaded and just sitting there. I have an early 2015 13' MacBook Pro with 16GB of RAM, 500GB solid state HD and an 3.1 GHz intel i7 processor. Just loading the array overflows to virtual memory but not to the point where my machine suffers or I have to stop everything else I am doing.
I build this VERY large array step by step from 22 smaller (N, 2) subarrays.
Function FUN_1 generates 2 new (N, 1) arrays using each of the 22 subarrays which I call sub_arr.
The first output of FUN_1 is generated by interpolating values from sub_arr[:,0] on array b = array([X, F(X)]) and the second output is generated by placing sub_arr[:, 0] into bins using array r = array([X, BIN(X)]). I call these outputs b_arr and rate_arr, respectively. The function returns a 3-tuple of (N, 1) arrays:
import numpy as np
def FUN_1(sub_arr):
"""interpolate b values and rates based on position in sub_arr"""
b = np.load(bfile)
r = np.load(rfile)
b_arr = np.interp(sub_arr[:,0], b[:,0], b[:,1])
rate_arr = np.searchsorted(r[:,0], sub_arr[:,0]) # HUGE efficiency gain over np.digitize...
return r[rate_r, 1], b_arr, sub_arr[:,1]
I call the function 22 times in a for-loop and fill a pre-allocated array of zeros full_arr = numpy.zeros([868940742, 3]) with the values:
full_arr[:,0], full_arr[:,1], full_arr[:,2] = FUN_1
In terms of saving memory at this step, I think this is the best I can do, but I'm open to suggestions. Either way, I don't run into problems up through this point and it only takes about 2 minutes.
Here is the sorting routine (there are two consecutive sorts)
for idx in range(2):
sort_idx = numpy.argsort(full_arr[:,idx])
full_arr = full_arr[sort_idx]
# ...
# <additional processing, return small (1000, 3) array of stats>
Now this sort had been working, albeit slowly (takes about 10 minutes). However, I recently started using a larger, more fine resolution table of [X, F(X)] values for the interpolation step above in FUN_1 that returns b_arr and now the SORT really slows down, although everything else remains the same.
Interestingly, I am not even sorting on the interpolated values at the step where the sort is now lagging. Here are some snippets of the different interpolation files - the smaller one is about 30% smaller in each case and far more uniform in terms of values in the second column; the slower one has a higher resolution and many more unique values, so the results of interpolation are likely more unique, but I'm not sure if this should have any kind of effect...?
bigger, slower file:
17399307 99.4
17493652 98.8
17570460 98.2
17575180 97.6
17577127 97
17578255 96.4
17580576 95.8
17583028 95.2
17583699 94.6
17584172 94
smaller, more uniform regular file:
1 24
1001 24
2001 24
3001 24
4001 24
5001 24
6001 24
7001 24
I'm not sure what could be causing this issue and I would be interested in any suggestions or just general input about sorting in this type of memory limiting case!

At the moment each call to np.argsort is generating a (868940742, 1) array of int64 indices, which will take up ~7 GB just by itself. Additionally, when you use these indices to sort the columns of full_arr you are generating another (868940742, 1) array of floats, since fancy indexing always returns a copy rather than a view.
One fairly obvious improvement would be to sort full_arr in place using its .sort() method. Unfortunately, .sort() does not allow you to directly specify a row or column to sort by. However, you can specify a field to sort by for a structured array. You can therefore force an inplace sort over one of the three columns by getting a view onto your array as a structured array with three float fields, then sorting by one of these fields:
full_arr.view('f8, f8, f8').sort(order=['f0'], axis=0)
In this case I'm sorting full_arr in place by the 0th field, which corresponds to the first column. Note that I've assumed that there are three float64 columns ('f8') - you should change this accordingly if your dtype is different. This also requires that your array is contiguous and in row-major format, i.e. full_arr.flags.C_CONTIGUOUS == True.
Credit for this method should go to Joe Kington for his answer here.
Although it requires less memory, sorting a structured array by field is unfortunately much slower compared with using np.argsort to generate an index array, as you mentioned in the comments below (see this previous question). If you use np.argsort to obtain a set of indices to sort by, you might see a modest performance gain by using np.take rather than direct indexing to get the sorted array:
%%timeit -n 1 -r 100 x = np.random.randn(10000, 2); idx = x[:, 0].argsort()
x[idx]
# 1 loops, best of 100: 148 µs per loop
%%timeit -n 1 -r 100 x = np.random.randn(10000, 2); idx = x[:, 0].argsort()
np.take(x, idx, axis=0)
# 1 loops, best of 100: 42.9 µs per loop
However I wouldn't expect to see any difference in terms of memory usage, since both methods will generate a copy.
Regarding your question about why sorting the second array is faster - yes, you should expect any reasonable sorting algorithm to be faster when there are fewer unique values in the array because on average there's less work for it to do. Suppose I have a random sequence of digits between 1 and 10:
5 1 4 8 10 2 6 9 7 3
There are 10! = 3628800 possible ways to arrange these digits, but only one in which they are in ascending order. Now suppose there are just 5 unique digits:
4 4 3 2 3 1 2 5 1 5
Now there are 2⁵ = 32 ways to arrange these digits in ascending order, since I could swap any pair of identical digits in the sorted vector without breaking the ordering.
By default, np.ndarray.sort() uses Quicksort. The qsort variant of this algorithm works by recursively selecting a 'pivot' element in the array, then reordering the array such that all the elements less than the pivot value are placed before it, and all of the elements greater than the pivot value are placed after it. Values that are equal to the pivot are already sorted. Having fewer unique values means that, on average, more values will be equal to the pivot value on any given sweep, and therefore fewer sweeps are needed to fully sort the array.
For example:
%%timeit -n 1 -r 100 x = np.random.random_integers(0, 10, 100000)
x.sort()
# 1 loops, best of 100: 2.3 ms per loop
%%timeit -n 1 -r 100 x = np.random.random_integers(0, 1000, 100000)
x.sort()
# 1 loops, best of 100: 4.62 ms per loop
In this example the dtypes of the two arrays are the same. If your smaller array has a smaller item size compared with the larger array then the cost of copying it due to the fancy indexing will also be smaller.

EDIT: In case anyone new to programming and numpy comes across this post, I want to point out the importance of considering the np.dtype that you are using. In my case, I was actually able to get away with using half-precision floating point, i.e. np.float16, which reduced a 20GB object in memory to 5GB and made sorting much more manageable. The default used by numpy is np.float64, which is a lot of precision that you may not need. Check out the doc here, which describes the capacity of the different data types. Thanks to #ali_m for pointing this out in the comments.
I did a bad job explaining this question but I have discovered some helpful workarounds that I think would be useful to share for anyone who needs to sort a truly massive numpy array.
I am building a very large numpy array from 22 "sub-arrays" of human genome data containing the elements [position, value]. Ultimately, the final array must be numerically sorted "in place" based on the values in a particular column and without shuffling the values within rows.
The sub-array dimensions follow the form:
arr1.shape = (N1, 2)
...
arr22.shape = (N22, 2)
sum([N1..N2]) = 868940742 i.e. there are close to 1BN positions to sort.
First I process the 22 sub-arrays with the function process_sub_arrs, which returns a 3-tuple of 1D arrays the same length as the input. I stack the 1D arrays into a new (N, 3) array and insert them into an np.zeros array initialized for the full dataset:
full_arr = np.zeros([868940742, 3])
i, j = 0, 0
for arr in list(arr1..arr22):
# indices (i, j) incremented at each loop based on sub-array size
j += len(arr)
full_arr[i:j, :] = np.column_stack( process_sub_arrs(arr) )
i = j
return full_arr
EDIT: Since I realized my dataset could be represented with half-precision floats, I now initialize full_arr as follows: full_arr = np.zeros([868940742, 3], dtype=np.float16), which is only 1/4 the size and much easier to sort.
Result is a massive 20GB array:
full_arr.nbytes = 20854577808
As #ali_m pointed out in his detailed post, my earlier routine was inefficient:
sort_idx = np.argsort(full_arr[:,idx])
full_arr = full_arr[sort_idx]
the array sort_idx, which is 33% the size of full_arr, hangs around and wastes memory after sorting full_arr. This sort supposedly generates a copy of full_arr due to "fancy" indexing, potentially pushing memory use to 233% of what is already used to hold the massive array! This is the slow step, lasting about ten minutes and relying heavily on virtual memory.
I'm not sure the "fancy" sort makes a persistent copy however. Watching the memory usage on my machine, it seems that full_arr = full_arr[sort_idx] deletes the reference to the unsorted original, because after about 1 second all that is left is the memory used by the sorted array and the index, even if there is a transient copy.
A more compact usage of argsort() to save memory is this one:
full_arr = full_arr[full_arr[:,idx].argsort()]
This still causes a spike at the time of the assignment, where both a transient index array and a transient copy are made, but the memory is almost instantly freed again.
#ali_m pointed out a nice trick (credited to Joe Kington) for generating a de facto structured array with a view on full_arr. The benefit is that these may be sorted "in place", maintaining stable row order:
full_arr.view('f8, f8, f8').sort(order=['f0'], axis=0)
Views work great for performing mathematical array operations, but for sorting it is far too inefficient for even a single sub-array from my dataset. In general, structured arrays just don't seem to scale very well even though they have really useful properties. If anyone has any idea why this is I would be interested to know.
One good option to minimize memory consumption and improve performance with very large arrays is to build a pipeline of small, simple functions. Functions clear local variables once they have completed so if intermediate data structures are building up and sapping memory this can be a good solution.
This a sketch of the pipeline I've used to speed up the massive array sort:
def process_sub_arrs(arr):
"""process a sub-array and return a 3-tuple of 1D values arrays"""
return values1, values2, values3
def build_arr():
"""build the initial array by joining processed sub-arrays"""
full_arr = np.zeros([868940742, 3])
i, j = 0, 0
for arr in list(arr1..arr22):
# indices (i, j) incremented at each loop based on sub-array size
j += len(arr)
full_arr[i:j, :] = np.column_stack( process_sub_arrs(arr) )
i = j
return full_arr
def sort_arr():
"""return full_arr and sort_idx"""
full_arr = build_arr()
sort_idx = np.argsort(full_arr[:, index])
return full_arr[sort_idx]
def get_sorted_arr():
"""call through nested functions to return the sorted array"""
sorted_arr = sort_arr()
<process sorted_arr>
return statistics
call stack: get_sorted_arr --> sort_arr --> build_arr --> process_sub_arrs
Once each inner function is completed get_sorted_arr() finally just holds the sorted array and then returns a small array of statistics.
EDIT: It is also worth pointing out here that even if you are able to use a more compact dtype to represent your huge array, you will want to use higher precision for summary calculations. For example, since full_arr.dtype = np.float16, the command np.mean(full_arr[:,idx]) tries to calculate the mean in half-precision floating point, but this quickly overflows when summing over a massive array. Using np.mean(full_arr[:,idx], dtype=np.float64) will prevent the overflow.
I posted this question initially because I was puzzled by the fact that a dataset of identical size suddenly began choking up my system memory, although there was a big difference in the proportion of unique values in the new "slow" set. #ali_m pointed out that, indeed, more uniform data with fewer unique values is easier to sort:
The qsort variant of Quicksort works by recursively selecting a
'pivot' element in the array, then reordering the array such that all
the elements less than the pivot value are placed before it, and all
of the elements greater than the pivot value are placed after it.
Values that are equal to the pivot are already sorted, so intuitively,
the fewer unique values there are in the array, the smaller the number
of swaps there are that need to be made.
On that note, the final change I ended up making to attempt to resolve this issue was to round the newer dataset in advance, since there was an unnecessarily high level of decimal precision leftover from an interpolation step. This ultimately had an even bigger effect than the other memory saving steps, showing that the sort algorithm itself was the limiting factor in this case.
Look forward to other comments or suggestions anyone might have on this topic, and I almost certainly misspoke about some technical issues so I would be glad to hear back :-)

How do I fill two (or more) numpy arrays from a single iterable of tuples?

The actual problem I have is that I want to store a long sorted list of (float, str) tuples in RAM. A plain list doesn't fit in my 4Gb RAM, so I thought I could use two numpy.ndarrays.
The source of the data is an iterable of 2-tuples. numpy has a fromiter function, but how can I use it? The number of items in the iterable is unknown. I can't consume it to a list first due to memory limitations. I thought of itertools.tee, but it seems to add a lot of memory overhead here.
What I guess I could do is consume the iterator in chunks and add those to the arrays. Then my question is, how to do that efficiently? Should I maybe make 2 2D arrays and add rows to them? (Then later I'd need to convert them to 1D).
Or maybe there's a better approach? Everything I really need is to search through an array of strings by the value of the corresponding number in logarithmic time (that's why I want to sort by the value of float) and to keep it as compact as possible.
P.S. The iterable is not sorted.

Perhaps build a single, structured array using np.fromiter:
import numpy as np
def gendata():
# You, of course, have a different gendata...
for i in xrange(N):
yield (np.random.random(), str(i))
N = 100
arr = np.fromiter(gendata(), dtype='<f8,|S20')
Sorting it by the first column, using the second for tie-breakers will take O(N log N) time:
arr.sort(order=['f0','f1'])
Finding the row by the value in the first column can be done with searchsorted in O(log N) time:
# Some pseudo-random value in arr['f0']
val = arr['f0'][10]
print(arr[10])
# (0.049875262239617246, '46')
idx = arr['f0'].searchsorted(val)
print(arr[idx])
# (0.049875262239617246, '46')
You've asked many important questions in the comments; let me attempt to answer them here:
The basic dtypes are explained in the numpybook. There may be one or
two extra dtypes (like float16 which have been added since that
book was written, but the basics are all explained there.)
Perhaps a more thorough discussion is in the online documentation. Which is a good supplement to the examples you mentioned here.
Dtypes can be used to define structured arrays with column names, or
with default column names. 'f0', 'f1', etc. are default column
names. Since I defined the dtype as '<f8,|S20' I failed to provide
column names, so NumPy named the first column 'f0', and the second
'f1'. If we had used
dtype='[('fval','<f8'), ('text','|S20')]
then the structured array arr would have column names 'fval' and
'text'.
Unfortunately, the dtype has to be fixed at the time np.fromiter is called. You
could conceivably iterate through gendata once to discover the
maximum length of the strings, build your dtype and then call
np.fromiter (and iterate through gendata a second time), but
that's rather burdensome. It is of course better if you know in
advance the maximum size of the strings. (|S20 defines the string
field as having a fixed length of 20 bytes.)
NumPy arrays place data of a
pre-defined size in arrays of a fixed size. Think of the array (even multidimensional ones) as a contiguous block of one-dimensional memory. (That's an oversimplification -- there are non-contiguous arrays -- but will help your imagination for the following.) NumPy derives much of its speed by taking advantage of the fixed sizes (set by the dtype) to quickly compute the offsets needed to access elements in the array. If the strings had variable sizes, then it
would be hard for NumPy to find the right offsets. By hard, I mean
NumPy would need an index or somehow be redesigned. NumPy is simply not
built this way.
NumPy does have an object dtype which allows you to place a 4-byte
pointer to any Python object you desire. This way, you can have NumPy
arrays with arbitrary Python data. Unfortunately, the np.fromiter
function does not allow you to create arrays of dtype object. I'm not sure why there is this restriction...
Note that np.fromiter has better performance when the count is
specified. By knowing the count (the number of rows) and the
dtype (and thus the size of each row) NumPy can pre-allocate
exactly enough memory for the resultant array. If you do not specify
the count, then NumPy will make a guess for the initial size of the
array, and if too small, it will try to resize the array. If the
original block of memory can be extended you are in luck. But if
NumPy has to allocate an entirely new hunk of memory then all the old
data will have to be copied to the new location, which will slow down
the performance significantly.

Here is a way to build N separate arrays out of a generator of N-tuples:
import numpy as np
import itertools as IT
def gendata():
# You, of course, have a different gendata...
N = 100
for i in xrange(N):
yield (np.random.random(), str(i))
def fromiter(iterable, dtype, chunksize=7):
chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype)
result = [chunk[name].copy() for name in chunk.dtype.names]
size = len(chunk)
while True:
chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype)
N = len(chunk)
if N == 0:
break
newsize = size + N
for arr, name in zip(result, chunk.dtype.names):
col = chunk[name]
arr.resize(newsize, refcheck=0)
arr[size:] = col
size = newsize
return result
x, y = fromiter(gendata(), '<f8,|S20')
order = np.argsort(x)
x = x[order]
y = y[order]
# Some pseudo-random value in x
N = 10
val = x[N]
print(x[N], y[N])
# (0.049875262239617246, '46')
idx = x.searchsorted(val)
print(x[idx], y[idx])
# (0.049875262239617246, '46')
The fromiter function above reads the iterable in chunks (of size chunksize). It calls the NumPy array method resize to extend the resultant arrays as necessary.
I used a small default chunksize since I was testing this code on small data. You, of course, will want to either change the default chunksize or pass a chunksize parameter with a larger value.

Big Satellite Image Processing

Im trying to run Mort Canty's http://mcanty.homepage.t-online.de/ Python iMAD implementation on bitemporal RapidEye Multispectral images. Which basically calculates the canonical correlation for the two images and then substracts them. The problem I'm having is
that the images are of 5000 x 5000 x 5 (bands) pixels. If I try to run this on
the whole image I get a memory error.
Would the use of something like pyTables help me with this?
What Mort Canty's code tries to do is that it loads the images using gdal and then stores them
in an 10 x 25,000,000 array.
# initial weights
wt = ones(cols*rows)
# data array (transposed so observations are columns)
dm = zeros((2*bands,cols*rows))
k = 0
for b in pos:
band1 = inDataset1.GetRasterBand(b+1)
band1 = band1.ReadAsArray(x0,y0,cols,rows).astype(float)
dm[k,:] = ravel(band1)
band2 = inDataset2.GetRasterBand(b+1)
band2 = band2.ReadAsArray(x0,y0,cols,rows).astype(float)
dm[bands+k,:] = ravel(band2)
k += 1
Even just creating a 10 x 25,000,000 numpy array of floats throws a memory error. Anyone have a good idea of how to get around this? This is my first post ever so any advice on how to post would also be welcome.
Greetings

numpy uses float64 per default, so your dm-array takes up 2GB of memory (8*10*25000000), the other arrays probably about 200MB (~8*5000*5000) each.
astype(float) returns a new array, so you need memory for that as well - and is probably not even needed as the type is implicitly converted when copying the data to the result array.
when the memory used in the for-loop is freed depends on garbage collection. and this doesn't consider the memory overhead of GetRasterBand, ReadAsArray.
are your sure your input data uses 64-bit floats? if it uses 32-bit floats, you could easyliy half the memory usage by specifying dtype='f' on your arrays.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Numpy integers high memory use - python

Related

memory usage: numpy-arrays vs python-lists

RAM usage in dealing with numpy arrays and Python lists

Memory efficient sort of massive numpy array in Python

How do I fill two (or more) numpy arrays from a single iterable of tuples?

Big Satellite Image Processing

Categories

Resources