How much memory is used by a numpy ndarray? - python

Does anybody know how much memory is used by a numpy ndarray? (with let's say 10,000,000 float elements).

The array is simply stored in one consecutive block in memory. Assuming by "float" you mean standard double precision floating point numbers, then the array will need 8 bytes per element.
In general, you can simply query the nbytes attribute for the total memory requirement of an array, and itemsize for the size of a single element in bytes:
>>> a = numpy.arange(1000.0)
>>> a.nbytes
8000
>>> a.itemsize
8
In addtion to the actual array data, there will also be a small data structure containing the meta-information on the array. Especially for large arrays, the size of this data structure is negligible.

To get the total memory footprint of the NumPy array in bytes, including the metadata, you can use Python's sys.getsizeof() function:
import sys
import numpy as np
a = np.arange(1000.0)
sys.getsizeof(a)
8104 bytes is the result
sys.getsizeof() works for any Python object. It reports the internal memory allocation, not necessarily the memory footprint of the object once it is written out to some file format. Sometimes it is wildly misleading. For example, with 2d arrays, it doesn't dig into the memory footprint of vector.
See the docs here. Ned Batcheldor shares caveats with using sys.getsizeof() here.

I gauss, easily, we can compute by print(a.size // 1024 // 1024, a.dtype)
it is similar to how much MB is uesd, however with the param dtype, float=8B, int8=1B ...

Related

Does NumPy array really take less memory than python list?

Please refer to below execution -
import sys
_list = [2,55,87]
print(f'1 - Memory used by Python List - {sys.getsizeof(_list)}')
narray = np.array([2,55,87])
size = narray.size * narray.itemsize
print(f'2 - Memory usage of np array using itemsize - {size}')
print(f'3 - Memory usage of np array using getsizeof - {sys.getsizeof(narray)}')
Here is what I get in result
1 - Memory used by Python List - 80
2 - Memory usage of np array using itemsize - 12
3 - Memory usage of np array using getsizeof - 116
One way of calculation suggests numpy array is consuming way too less memory but other says it is consuming more than regular python list? Shouldn't I be using getSizeOf with numpy array. What I am doing wrong here?
Edit - I just checked, an empty python list is consuming 56 bytes whereas an empty np array 104. Is this space being used in pointing to associated built-in methods and attributes?
The calculation using:
size = narray.size * narray.itemsize
does not include the memory consumed by non-element attributes of the array object. This can be verified by the documentation of ndarray.nbytes:
>>> x = np.zeros((3,5,2), dtype=np.complex128)
>>> x.nbytes
480
>>> np.prod(x.shape) * x.itemsize
480
In the above link, it can be read that ndarray.nbytes:
Does not include memory consumed by non-element attributes of the
array object.
Note that from the code above you can conclude that your calculation excludes non-element attributes given that the value is equal to the one from ndarray.nbytes.
A list of the non-element attributes can be found in the section Array Attributes, including here for completeness:
ndarray.flags Information about the memory layout of the array.
ndarray.shape Tuple of array dimensions.
ndarray.strides Tuple of
bytes to step in each dimension when traversing an array.
ndarray.ndim Number of array dimensions.
ndarray.data Python buffer object pointing to the start of the array’s data.
ndarray.size Number of elements in the array.
ndarray.itemsize Length of one array element in bytes.
ndarray.nbytes Total bytes consumed by the elements of the array.
ndarray.base Base object if memory is from some other object.
With regards to sys.getsizeof it can be read in the documentation (emphasis mine) that:
Only the memory consumption directly attributed to the object is
accounted for, not the memory consumption of objects it refers to.
Search on [numpy]getsizeof produces many potential duplicates.
The basic points are:
a list is a container, and getsizeof docs warns us that it returns only the size of the container, not the size of the elements that it references. So by itself it is an unreliable measure to the total size of a list (or tuple or dict).
getsizeof is a fairly good measure of arrays, if you take into account the roughly 100 bytes of "overhead". That overhead will be a big part of a small array, and a minor thing when looking at a large one. nbytes is the simpler way of judging array memory use.
But for views, the data-buffer is shared with the base, and doesn't count when using getsizeof.
object dtype arrays contain references like lists, to the same getsizeof caution applies.
Overall I think understanding how arrays and lists are stored is more useful way of judging their respective memory use. Focus more on the computational efficiency than memory use. For small stuff, and iterative uses, lists are better. Arrays are best when they are large, and you use array methods to do the calculations.
Because numpy arrays have shapes, strides, and other member variables that define the data layout it is reasonable that (might) require some extra memory for this!
A list on the other hand has no specific type, or shape, etc.
Although, if you start appending elements on a list instead of simply writing them as an array, and also go to larger numbers of elements, e.g. 1e7, you will see different behaviour!
Example case:
import numpy as np
import sys
N = int(1e7)
narray = np.zeros(N);
mylist = []
for i in range(N):
mylist.append(narray[i])
print("size of np.array:", sys.getsizeof(narray))
print("size of list :", sys.getsizeof(mylist))
On my (ASUS) Ubuntu 20.04 PC I get:
size of np.array: 80000104
size of list : 81528048
Note that is not only the memory footprint important in an application's efficiency! The data layout is sometimes way more important.

How can I make my Python program use 4 bytes for an int instead of 24 bytes?

To save memory, I want to use less bytes (4) for each int I have instead of 24.
I looked at structs, but I don't really understand how to use them.
https://docs.python.org/3/library/struct.html
When I do the following:
myInt = struct.pack('I', anInt)
sys.getsizeof(myInt) doesn't return 4 like I expected.
Is there something that I am doing wrong? Is there another way for Python to save memory for each variable?
ADDED: I have 750,000,000 integers in an array that I wish to be able to use given an index.
If you want to hold many integers in an array, use a numpy ndarray. Numpy is a very popular third-party package that handles arrays more compactly than Python alone does. Numpy is not in the standard library so that it could be updated more frequently than Python itself is updated--it was considered to be added to the standard library. Numpy is one of the reasons Python has become so popular for Data Science and for other scientific uses.
Numpy's np.int32 type uses four bytes for an integer. Declare your array full of zeros with
import numpy as np
myarray = np.zeros((750000000,), dtype=np.int32)
Or if you just want the array and do not want to spend any time initializing the values,
myarray = np.empty((750000000,), dtype=np.int32)
You then fill and use the array as you like. There is some Python overhead for the complete array, so the array's size will be slightly larger than 4 * 750000000, but the size will be close.

Comparison between a list and numpy array

Well I am a newbie in python and i recently started with numpy intro. Starting with the comparison between numpy and list, numpy occupies less memory storage space. But after what i tried in the IDLE shell , I am confused. Here's what I have done
list1=[1,2,3]
sys.getsizeof(list1)
48
a=np.array([1,2,3])
sys.getsizeof(a)
60
Why is the numpy array, I created occupying more size than the list object?
First of all, getsizeof is not always the best way to compare the size of these two objects. From the docs:
Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.
To answer your question however, what you're seeing here is simply the additional overhead of a numpy array, which will provide skewed results on such a small input sample.
If you want to know the size of the just the data contained in a numpy array, there is an attribute you can check:
>>> a = np.array([1,2,3])
>>> a.nbytes
12
>>> a = np.array([1,2,3], dtype=np.int8)
>>> a.nbytes
3
This will not include the overhead:
Does not include memory consumed by non-element attributes of the array object.

List with sparse data consumes less memory then the same data as numpy array

I am working with very high dimensional vectors for machine learning and was thinking about using numpy to reduce the amount of memory used. I run a quick test to see how much memory I could save using numpy (1)(3):
Standard list
import random
random.seed(0)
vector = [random.random() for i in xrange(2**27)]
Numpy array
import numpy
import random
random.seed(0)
vector = numpy.fromiter((random.random() for i in xrange(2**27)), dtype=float)
Memory usage (2)
Numpy array: 1054 MB
Standard list: 2594 MB
Just like I expected.
By allocing a continues block of memory with native floats numpy only consumes about half of the memory the standard list is using.
Because I know my data is pretty spare, I did the same test with sparse data.
Standard list
import random
random.seed(0)
vector = [random.random() if random.random() < 0.00001 else 0.0 for i in xrange(2 ** 27)]
Numpy array
from numpy import fromiter
from random import random
random.seed(0)
vector = numpy.fromiter((random.random() if random.random() < 0.00001 else 0.0 for i in xrange(2 ** 27)), dtype=float)
Memory usage (2)
Numpy array: 1054 MB
Standard list: 529 MB
Now all of the sudden, the python list uses half the amount of memory the numpy array uses! Why?
One thing I could think of is that python dynamically switches to a dict representation when it detects that it contains very sparse data. Checking this could potentially add a lot of extra run-time overhead so I don't really think that this is going on.
Notes
I started a fresh new python shell for every test.
Memory measured with htop.
Run on 32bit Debian.
A Python list is just an array of references (pointers) to Python objects. In CPython (the usual Python implementation) a list gets slightly over-allocated to make expansion more efficient, but it never gets converted to a dict. See the source code for further details: List object implementation
In the sparse version of the list, you have a lot of pointers to a single int 0 object. Those pointers take up 32 bits = 4 bytes, but your numpy floats are certainly larger, probably 64 bits.
FWIW, to make the sparse list / array tests more accurate you should call random.seed(some_const) with the same seed in both versions so that you get the same number of zeroes in both the Python list and the numpy array.

Is there a way to get a view into a python array.array()?

I'm generating many largish 'random' files (~500MB) in which the contents are the output of repeated calls to random.randint(...). I'd like to preallocate a large buffer, write longs to that buffer, and periodically flush that buffer to disk. I am currently using array.array() but I can't see a way to create a view into this buffer. I need to do this so that I can feed the part of the buffer with valid data into hashlib.update(...) and to write the valid part of the buffer to the file. I could use the slice operator but AFAICT that creates a copy of the buffer, which isn't what I want.
Is there a way to do this that I'm not seeing?
Update:
I went with numpy as user42005 and hgomersall suggested. Unfortunately this didn't give me the speedups I was looking for. My dirt-simple C program generates ~700MB of data in 11s, while my python equivalent using numpy takes around 700s! It's hard to believe that that's the difference in performance between the two (I'm more likely to believe that I made a naive mistake somewhere...)
I guess you could use numpy: http://www.numpy.org - the fundamental array type in numpy at least supports no-copy views.
Numpy is incredibly flexible and powerful when it comes to views into arrays whilst minimising copies. For example:
import numpy
a = numpy.random.randint(0, 10, size=10)
b = numpy.a[3:10]
b is now a view of the original array that was created.
Numpy arrays allow all manner of access directly to the data buffers, and can be trivially typecast. For example:
a = numpy.random.randint(0, 10, size=10)
b = numpy.frombuffer(a.data, dtype='int8')
b is now view into the memory with the data all as 8-bit integers (the data itself remains unchanged, so that each 64-bit int now becomes 8 8-bit ints). These buffer objects (from a.data) are standard python buffer objects and so can be used in all the places that are defined to work with buffers.
The same is true for multi-dimensional arrays. However, you have to bear in mind how the data lies in memory. For example:
a = numpy.random.randint(0, 10, size=(10, 10))
b = numpy.frombuffer(a[3,:].data, dtype='int8')
will work, but
b = numpy.frombuffer(a[:,3].data, dtype='int8')
returns an error about being unable to get single-segment buffer for discontiguous arrays. This problem is not obvious because simply allocating that same view to a variable using
b = a[:,3]
returns a perfectly adequate numpy array. However, it is not contiguous in memory as it's a view into the other array, which need not be (and in this case isn't) a view of contiguous memory. You can get info about the array using the flags attribute on an array:
a[:,3].flags
which returns (among other things) both C_CONTIGUOUS (C order, row major) and F_CONTIGUOUS (Fortran order, column major) as False, but
a[3,:].flags
returns them both as True (in 2D arrays, at most one of them can be true).

Categories

Resources