Comparison between a list and numpy array

Comparison between a list and numpy array - python

Well I am a newbie in python and i recently started with numpy intro. Starting with the comparison between numpy and list, numpy occupies less memory storage space. But after what i tried in the IDLE shell , I am confused. Here's what I have done
list1=[1,2,3]
sys.getsizeof(list1)
48
a=np.array([1,2,3])
sys.getsizeof(a)
60
Why is the numpy array, I created occupying more size than the list object?

First of all, getsizeof is not always the best way to compare the size of these two objects. From the docs:
Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.
To answer your question however, what you're seeing here is simply the additional overhead of a numpy array, which will provide skewed results on such a small input sample.
If you want to know the size of the just the data contained in a numpy array, there is an attribute you can check:
>>> a = np.array([1,2,3])
>>> a.nbytes
12
>>> a = np.array([1,2,3], dtype=np.int8)
>>> a.nbytes
3
This will not include the overhead:
Does not include memory consumed by non-element attributes of the array object.

Related

Does NumPy array really take less memory than python list?

Please refer to below execution -
import sys
_list = [2,55,87]
print(f'1 - Memory used by Python List - {sys.getsizeof(_list)}')
narray = np.array([2,55,87])
size = narray.size * narray.itemsize
print(f'2 - Memory usage of np array using itemsize - {size}')
print(f'3 - Memory usage of np array using getsizeof - {sys.getsizeof(narray)}')
Here is what I get in result
1 - Memory used by Python List - 80
2 - Memory usage of np array using itemsize - 12
3 - Memory usage of np array using getsizeof - 116
One way of calculation suggests numpy array is consuming way too less memory but other says it is consuming more than regular python list? Shouldn't I be using getSizeOf with numpy array. What I am doing wrong here?
Edit - I just checked, an empty python list is consuming 56 bytes whereas an empty np array 104. Is this space being used in pointing to associated built-in methods and attributes?

The calculation using:
size = narray.size * narray.itemsize
does not include the memory consumed by non-element attributes of the array object. This can be verified by the documentation of ndarray.nbytes:
>>> x = np.zeros((3,5,2), dtype=np.complex128)
>>> x.nbytes
480
>>> np.prod(x.shape) * x.itemsize
480
In the above link, it can be read that ndarray.nbytes:
Does not include memory consumed by non-element attributes of the
array object.
Note that from the code above you can conclude that your calculation excludes non-element attributes given that the value is equal to the one from ndarray.nbytes.
A list of the non-element attributes can be found in the section Array Attributes, including here for completeness:
ndarray.flags Information about the memory layout of the array.
ndarray.shape Tuple of array dimensions.
ndarray.strides Tuple of
bytes to step in each dimension when traversing an array.
ndarray.ndim Number of array dimensions.
ndarray.data Python buffer object pointing to the start of the array’s data.
ndarray.size Number of elements in the array.
ndarray.itemsize Length of one array element in bytes.
ndarray.nbytes Total bytes consumed by the elements of the array.
ndarray.base Base object if memory is from some other object.
With regards to sys.getsizeof it can be read in the documentation (emphasis mine) that:
Only the memory consumption directly attributed to the object is
accounted for, not the memory consumption of objects it refers to.

Search on [numpy]getsizeof produces many potential duplicates.
The basic points are:
a list is a container, and getsizeof docs warns us that it returns only the size of the container, not the size of the elements that it references. So by itself it is an unreliable measure to the total size of a list (or tuple or dict).
getsizeof is a fairly good measure of arrays, if you take into account the roughly 100 bytes of "overhead". That overhead will be a big part of a small array, and a minor thing when looking at a large one. nbytes is the simpler way of judging array memory use.
But for views, the data-buffer is shared with the base, and doesn't count when using getsizeof.
object dtype arrays contain references like lists, to the same getsizeof caution applies.
Overall I think understanding how arrays and lists are stored is more useful way of judging their respective memory use. Focus more on the computational efficiency than memory use. For small stuff, and iterative uses, lists are better. Arrays are best when they are large, and you use array methods to do the calculations.

Because numpy arrays have shapes, strides, and other member variables that define the data layout it is reasonable that (might) require some extra memory for this!
A list on the other hand has no specific type, or shape, etc.
Although, if you start appending elements on a list instead of simply writing them as an array, and also go to larger numbers of elements, e.g. 1e7, you will see different behaviour!
Example case:
import numpy as np
import sys
N = int(1e7)
narray = np.zeros(N);
mylist = []
for i in range(N):
mylist.append(narray[i])
print("size of np.array:", sys.getsizeof(narray))
print("size of list :", sys.getsizeof(mylist))
On my (ASUS) Ubuntu 20.04 PC I get:
size of np.array: 80000104
size of list : 81528048
Note that is not only the memory footprint important in an application's efficiency! The data layout is sometimes way more important.

How does np.ndarray.tobytes() work for dtype "object"?

I encountered a strange behavior of np.ndarray.tobytes() that makes me doubt that it is working deterministically, at least for arrays of dtype=object.
import numpy as np
print(np.array([1,[2]]).dtype)
# => object
print(np.array([1,[2]]).tobytes())
# => b'0h\xa3\t\x01\x00\x00\x00H{!-\x01\x00\x00\x00'
print(np.array([1,[2]]).tobytes())
# => b'0h\xa3\t\x01\x00\x00\x00\x88\x9d)-\x01\x00\x00\x00'
In the sample code, a list of mixed python objects ([1, [2]]) is first converted to a numpy array, and then transformed to a byte sequence using tobytes().
Why do the resulting byte-representations differ for repeated instantiations of the same data? The documentation just states that it converts an ndarray to raw python bytes, but it does not refer to any limitations. So far, I observed this just for dtype=object. Numeric arrays always yield the same byte sequence:
np.random.seed(42); print(np.random.rand(3).tobytes())
# b'\xecQ_\x1ew\xf8\xd7?T\xd6\xbbh#l\xee?Qg\x1e\x8f~l\xe7?'
np.random.seed(42); print(np.random.rand(3).tobytes())
# b'\xecQ_\x1ew\xf8\xd7?T\xd6\xbbh#l\xee?Qg\x1e\x8f~l\xe7?'
Have I missed an elementar thing about python's/numpy's memory architecture? I tested with numpy version 1.17.2 on a Mac.
Context: I encountered this problem when trying to compute a hash for arbitrary data structures. I hoped that I can rely on the basic serialization capabilities of tobytes(), but this appears to be a wrong premise. I know that pickle is the standard for serialization in python, but since I don't require portability and my data structures only contain numbers, I first sought help with numpy.

An array of dtype object stores pointers to the objects it contains. In CPython, this corresponds to the id. Every time you create a new list, it will be allocated at a new memory address. However, small integers are interned, so 1 will reference the same integer object every time.
You can see exactly how this works by checking the IDs of some sample objects:
>>> x = np.array([1, [2]])
>>> x.tobytes()
b'\x90\x91\x04a\xfb\x7f\x00\x00\xc8[J\xaa+\x02\x00\x00'
>>> id(x[0])
140717641208208
>>> id(1) # Small integers are interned
140717641208208
>>> id(x[0]).to_bytes(8, 'little') # Checks out as the first 8 bytes
b'\x90\x91\x04a\xfb\x7f\x00\x00'
>>> id(x[1]).to_bytes(8, 'little') # Checks out as the last 8 bytes
b'\xc8[J\xaa+\x02\x00\x00'
As you can see, it is quite deterministic, but serializes information that is essentially useless to you. The operation is the same for numeric arrays as for object arrays: it returns a view or copy of the underlying buffer. The contents of the buffer is what is throwing you off.
Since you mentioned that you are computing hashes, keep in mind that there is a reason that python lists are unhashable. You can have lists that are equal at one time and different at another. Using IDs is generally not a good idea for an effective hash.

NumPy empty() array not giving random float after defining normal NumPy array

I was using the NumPy np.empty() to get an array with a random value, but it doesn't work when I define a normal np.array() before.
Here is the two functions I used:
import numpy as np
def create_float_array(x):
return np.array([float(x)])
def get_empty_array():
return np.empty((), dtype=np.float).tolist()
Just to test the get_empty_array(), I wrote in the console:
>>> get_empty_array() # Should return a random float
>>> 0.007812501848093234
I was pleased with the result, so I tried this, but it didn't work the way I wanted:
>>> create_float_array(3.1415) # Create a NumPy array with the float given
>>> array([3.1415])
>>> get_empty_array() # Should return another random value in a NumPy array
>>> 3.1415
I am not too sure as to why creating a NumPy array affects the np.empty() method from giving a random value. Apparently, it gives the same value as the value in the np.array(), in this case, 3.1415.
Note that I chose to leave the shape of the np.empty() to nothing for testing purposes, but in reality it would have some shape.
Finally, I know this is not the correct way of getting random values, but I need to use the np.empty() in my program, but don't exactly know why this behaviour occurs.

Just to clarify the point:
np.empty is not giving truly random values. The offical NumPy documentation states that it will contain "uninitialized entries" or "arbitary data":
numpy.empty(shape, dtype=float, order='C')
Return a new array of given shape and type, without initializing entries.
[...]
Returns:
out : ndarray
Array of uninitialized (arbitrary) data of the given shape, dtype, and order. Object arrays will be initialized to None.
So what does uninitialized or arbitrary mean? To understand that you have to understand that when you create an object (any object) you need to ask someone (that someone can be the NumPy internals, the Python internals or your OS) for the required amount of memory.
So when you create an empty array NumPy asks for memory. The amount of memory for a NumPy array will be some overhead for the Python object and a certain amount of memory to contain the values of the array. That memory may contain anything. So an "uninitialized value" means that it simply contains whatever is in that memory you got.
What happened here is just a coincidence. You created an array containing one float, then you printed it, then it is destroyed again because you noone kept a reference to it (although that is CPython specific, other Python implementations may not free the memory immediately, they just free it eventually). Then you create an empty array containing one float. The amount of memory for the second array is identical to the amount of memory just released by the first memory. Here's where the coincidence comes in: So maybe something (NumPy, Python or your OS) decided to give you the same memory location again.

How much memory is used by a numpy ndarray?

Does anybody know how much memory is used by a numpy ndarray? (with let's say 10,000,000 float elements).

The array is simply stored in one consecutive block in memory. Assuming by "float" you mean standard double precision floating point numbers, then the array will need 8 bytes per element.
In general, you can simply query the nbytes attribute for the total memory requirement of an array, and itemsize for the size of a single element in bytes:
>>> a = numpy.arange(1000.0)
>>> a.nbytes
8000
>>> a.itemsize
8
In addtion to the actual array data, there will also be a small data structure containing the meta-information on the array. Especially for large arrays, the size of this data structure is negligible.

To get the total memory footprint of the NumPy array in bytes, including the metadata, you can use Python's sys.getsizeof() function:
import sys
import numpy as np
a = np.arange(1000.0)
sys.getsizeof(a)
8104 bytes is the result
sys.getsizeof() works for any Python object. It reports the internal memory allocation, not necessarily the memory footprint of the object once it is written out to some file format. Sometimes it is wildly misleading. For example, with 2d arrays, it doesn't dig into the memory footprint of vector.
See the docs here. Ned Batcheldor shares caveats with using sys.getsizeof() here.

I gauss, easily, we can compute by print(a.size // 1024 // 1024, a.dtype)
it is similar to how much MB is uesd, however with the param dtype, float=8B, int8=1B ...

Is there a way to get a view into a python array.array()?

I'm generating many largish 'random' files (~500MB) in which the contents are the output of repeated calls to random.randint(...). I'd like to preallocate a large buffer, write longs to that buffer, and periodically flush that buffer to disk. I am currently using array.array() but I can't see a way to create a view into this buffer. I need to do this so that I can feed the part of the buffer with valid data into hashlib.update(...) and to write the valid part of the buffer to the file. I could use the slice operator but AFAICT that creates a copy of the buffer, which isn't what I want.
Is there a way to do this that I'm not seeing?
Update:
I went with numpy as user42005 and hgomersall suggested. Unfortunately this didn't give me the speedups I was looking for. My dirt-simple C program generates ~700MB of data in 11s, while my python equivalent using numpy takes around 700s! It's hard to believe that that's the difference in performance between the two (I'm more likely to believe that I made a naive mistake somewhere...)

I guess you could use numpy: http://www.numpy.org - the fundamental array type in numpy at least supports no-copy views.

Numpy is incredibly flexible and powerful when it comes to views into arrays whilst minimising copies. For example:
import numpy
a = numpy.random.randint(0, 10, size=10)
b = numpy.a[3:10]
b is now a view of the original array that was created.
Numpy arrays allow all manner of access directly to the data buffers, and can be trivially typecast. For example:
a = numpy.random.randint(0, 10, size=10)
b = numpy.frombuffer(a.data, dtype='int8')
b is now view into the memory with the data all as 8-bit integers (the data itself remains unchanged, so that each 64-bit int now becomes 8 8-bit ints). These buffer objects (from a.data) are standard python buffer objects and so can be used in all the places that are defined to work with buffers.
The same is true for multi-dimensional arrays. However, you have to bear in mind how the data lies in memory. For example:
a = numpy.random.randint(0, 10, size=(10, 10))
b = numpy.frombuffer(a[3,:].data, dtype='int8')
will work, but
b = numpy.frombuffer(a[:,3].data, dtype='int8')
returns an error about being unable to get single-segment buffer for discontiguous arrays. This problem is not obvious because simply allocating that same view to a variable using
b = a[:,3]
returns a perfectly adequate numpy array. However, it is not contiguous in memory as it's a view into the other array, which need not be (and in this case isn't) a view of contiguous memory. You can get info about the array using the flags attribute on an array:
a[:,3].flags
which returns (among other things) both C_CONTIGUOUS (C order, row major) and F_CONTIGUOUS (Fortran order, column major) as False, but
a[3,:].flags
returns them both as True (in 2D arrays, at most one of them can be true).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.