Memory Efficiency of NumPy - python

While learning NumPy, I came across its advantage that,
NumPy requires less memory than traditional list.
import numpy as np
import sys
# Less Memory
l = range(1000)
print(sys.getsizeof(l[3])*len(l))
p = np.arange(1000)
print(p.itemsize*p.size)
this looks convincing, but than when I try,
print(sys.getsizeof(p[3])*len(p))
It shows higher memory size than list.
Can someone help me out understanding this behavior.

First off all, as mentioned in comments getsizeof() is not a good function to relay on for this purpose, because it does not have to hold true for third-party extensions as it is implementation specific. Also, as mentioned in documentation, if you want to find the size of containers and all their contents, there is a recipe available at: https://code.activestate.com/recipes/577504/.
Now, regarding the Numpy arrays, it's very important to know how Numpy determines its arrays' types. For that purpose, you can read: How does numpy determin the array's dtype and what it means?
To sum up, the most important reason that Numpy performs better in memory managements is that it provides a wide variety of types that you can use for different kinds of data. You can read about Numpy's datatypes here: https://docs.scipy.org/doc/numpy-1.14.0/user/basics.types.html. Another reason is that Numpy is a library designed to work with matrices and arrays and for that reason there are many under the hood optimizations on how their items consume the memory.
Also, it's note worthy that Python provides an array module designed to perform efficiently by using constrained item types.
Arrays are sequence types and behave very much like lists, except that the type of objects stored in them is constrained. The type is specified at object creation time by using a type code, which is a single character.

It's easier to understand the memory use of arrays:
In [100]: p = np.arange(10)
In [101]: sys.getsizeof(p)
Out[101]: 176
In [102]: p.itemsize*p.size
Out[102]: 80
The databuffer of p is 80 bytes long. The rest of p is object overhead, attributes like shape, strides, etc.
An indexed element of the array is a numpy object.
In [103]: q = p[0]
In [104]: type(q)
Out[104]: numpy.int64
In [105]: q.itemsize*q.size
Out[105]: 8
In [106]: sys.getsizeof(q)
Out[106]: 32
So this multiplication doesn't tell us anything useful:
In [109]: sys.getsizeof(p[3])*len(p)
Out[109]: 320
Though it may help us estimate the size of this list:
In [110]: [i for i in p]
Out[110]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [111]: type(_[0])
Out[111]: numpy.int64
In [112]: sys.getsizeof(__)
Out[112]: 192
The list of 10 int64 objects occupies 320+192 bytes, more or less (the list overhead and its pointer buffer plus the size objects pointed to).
We can extract an int object from the array with item:
In [115]: p[0].item()
Out[115]: 0
In [116]: type(_)
Out[116]: int
In [117]: sys.getsizeof(p[0].item())
Out[117]: 24
Lists of the same len can have differing size, depending on how much growth space they have:
In [118]: sys.getsizeof(p.tolist())
Out[118]: 144
Further complicating things is the fact that small integers have a different storage than large ones - ones below 256 are unique.

Related

Difference between list and NumPy array memory size

I've heard that Numpy arrays are more efficient then python built in list and that they take less space in memory. As I understand Numpy stores this objects next to each other in memory, while python implementation of the list stores 8 bytes pointers to given values. However, when I try to test in jupyter notebook it turns out that both objects have same size.
import numpy as np
from sys import getsizeof
array = np.array([_ for _ in range(4)])
getsizeof(array), array
Returns (128, array([0, 1, 2, 3]))
Same as:
l = list([_ for _ in range(4)])
getsizeof(l), l
Gives (128, [0, 1, 2, 3])
Can you provide any clear example on how can I show that in jupyter notebook?
getsizeof is not a good measure of memory use, especially with lists. As you note the list has a buffer of pointers to objects elsewhere in memory. getsizeof notes the size of the buffer, but tells us nothing about the objects.
With
In [66]: list(range(4))
Out[66]: [0, 1, 2, 3]
the list has its basic object storage, plus the buffer with 4 pointers (plus some growth room). The numbers are stored else where. In this case the numbers are small, and already created and cached by the interpreter. So their storage doesn't add anything. But larger numbers (and floats) are created with each use, and take up space. Also a list can contain anything, such as pointers to other lists, or strings or dicts, or what ever.
In [67]: arr = np.array([i for i in range(4)]) # via list
In [68]: arr
Out[68]: array([0, 1, 2, 3])
In [69]: np.array(range(4)) # more direct
Out[69]: array([0, 1, 2, 3])
In [70]: np.arange(4)
Out[70]: array([0, 1, 2, 3]) # faster
arr too has a basic object storage with attributes like shape and dtype. It too has a databuffer, but for a numeric dtype like this, that buffer has actual numeric values (8 byte integers), not pointers to Python integer objects.
In [71]: arr.nbytes
Out[71]: 32
That data buffer only takes 32 bytes - 4*8.
For this small example it's not surprising that getsizeof returns the same thing. The basic object storage is more significant than where the 4 values are stored. It's when working with 1000's of values, and multidimensional arrays that memory use is significantly different.
But more important is the calculation speeds. With an array you can do things like arr+1 or arr.sum(). These operate in compiled code, and are quite fast. Similar list operations have to iterate, at slow Python speeds, though the pointers, fetching values etc. But doing the same sort of iteration on arrays is even slower.
As a general rule, if you start with lists, and do list operations such as append and list comprehensions, it's best to stick with them.
But if you can create the arrays once, or from other arrays, and then use numpy methods, you'll get 10x speed improvements. Arrays are indeed faster, but only if you use them in the right way. They aren't a simple drop in substitute for lists.
NumPy array has general array information on the array object header (like shape,data type etc.). All the values stored in continous block of memory. But lists allocate new memory block for every new object and stores their pointer. So when you iterate over, you are not directly iterating on memory. you are iterating over pointers. So it is not handy when you are working with large data. Here is an example:
import sys
import numpy as np
random_values_numpy=np.arange(1000)
random_values=range(1000)
#Numpy
print(random_values_numpy.itemsize)
print(random_values_numpy.size*random_values_numpy.itemsize)
#PyList
print(sys.getsizeof(random_values))
print(sys.getsizeof(random_values)*len(random_values))

how does memory allocation occur in numpy array?

import numpy as np
a = np.arange(5)
for i in a:
print("Id of {} : {} \n".format(i,id(i)))
>>>>
Id of 0 : 2295176255984
Id of 1 : 2295176255696
Id of 2 : 2295176255984
Id of 3 : 2295176255696
Id of 4 : 2295176255984
I want to understand how the elements of numpy array are being allocated in the memory, which I understand is different from that of Python arrays seeing the output.
Any help is appreciated.
In [68]: arr = np.arange(5)
In [69]: arr
Out[69]: array([0, 1, 2, 3, 4])
One way of viewing the attributes of a numpy array is:
In [70]: arr.__array_interface__
Out[70]:
{'data': (139628245945184, False),
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (5,),
'version': 3}
data is something like the id of its data buffer, where the values are actually stored. We can't use this number in other code, but it is useful when checking for things like view. The rest is used to interpret those values.
The memory for arr is a c array 40 bytes long (5*8) somewhere. That where does not matter to us. Any view of arr will work with the same data buffer. a copy will have its own data buffer.
Iterating on the array is like accessing values one by one:
In [71]: i = arr[1]
In [72]: i
Out[72]: 1
In [73]: type(i)
Out[73]: numpy.int64
This i is not a reference to an element of a. It is new object with the same value. It's a lot like a 0d array, with many of the same attributes, including:
In [74]: i.__array_interface__
Out[74]:
{'data': (25251568, False),
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (),
'version': 3,
'__ref': array(1)}
This is why you can't make much sense from looking at the id in the iteration. It is also why iterating on a numpy array is slower than iterating on list. We strongly discourage iteration like this.
Contrast that with a list, where elements are stored (in some sort of data buffer) by reference:
In [78]: a,b,c = 100,'b',{}
In [79]: id(a)
Out[79]: 9788064
In [80]: alist=[a,b,c]
In [81]: id(alist[0])
Out[81]: 9788064
The list actually contains a, or if you prefer a reference to the same object that the variable a references. Remember, Python is object oriented all the way down.
In sum, Python lists contain references. Numpy arrays contain values, which its own methods access and manipulate. There is an object dtype that does contain references, but let's not go there.
I'm a fan of Code with Mosh. He teaches all such kind of things on his youtube channel as well as udemy. I've purchased his udemy course on Data structures and Algorithms which goes deep into how something works.
For example, while teaching about an array, he shows how to make an array so as to understand the underlying concepts behind it.
You can take a look here: https://www.youtube.com/watch?v=BBpAmxU_NQo
If you're only interested in knowing about only the NumPy array:
First I'll tell you about the differences:
Difference between NumPy and an Array
Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object and tools for working with these arrays. A NumPy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.
The Python core library provided Lists. A list is the Python equivalent of an array, but is resizeable and can contain elements of different types.
A common beginner question is what is the real difference here. The answer is performance. Numpy data structures perform better in:
Size - Numpy data structures take up less space
Performance - they have a need for speed and are faster than lists
Functionality - SciPy and NumPy have optimized functions such as linear algebra operations built-in.
Another key notable difference is in how they store and make use of memory
Memory
The main benefits of using NumPy arrays should be smaller memory consumption and better runtime behaviour.
For Python Lists - We can conclude from this that for every new element, we need another eight bytes for the reference to the new object. The new integer object itself consumes 28 bytes.
NumPy takes up less space. This means that an arbitrary integer array of length "n" in NumPy needs
If you are curious and want me to prove that NumPy really takes less time:
# importing required packages
import numpy
import time
# size of arrays and lists
size = 1000000
# declaring lists
list1 = range(size)
list2 = range(size)
# declaring arrays
array1 = numpy.arange(size)
array2 = numpy.arange(size)
# capturing time before the multiplication of Python lists
initialTime = time.time()
# multiplying elements of both the lists and stored in another list
resultantList = [(a * b) for a, b in zip(list1, list2)]
# calculating execution time
print("Time taken by Lists to perform multiplication:",
(time.time() - initialTime),
"seconds")
# capturing time before the multiplication of Numpy arrays
initialTime = time.time()
# multiplying elements of both the Numpy arrays and stored in another Numpy array
resultantArray = array1 * array2
# calculating execution time
print("Time taken by NumPy Arrays to perform multiplication:",
(time.time() - initialTime),
"seconds")
Output:
Time taken by Lists : 0.15030384063720703 seconds
Time taken by NumPy Arrays : 0.005921125411987305 seconds
Wait.. There is a very big disadvantage too:
Requires continuous allocation of memory -
Insertion and deletion operations can become costly as data is stored in contiguous memory locations as shifting it requires shifting.
If you want to learn more about numpy:
https://www.educba.com/introduction-to-numpy/
You can thank me later!

Numpy view contiguous part of non-contiguous array as dtype of bigger size

I was trying to generate an array of trigrams (i.e. continuous-three-letter combinations) from a super long char array:
# data is actually load from a source file
a = np.random.randint(0, 256, 2**28, 'B').view('c')
Since making copy is not efficient (and it creates problems like cache miss), I directly generated the trigram using stride tricks:
tri = np.lib.stride_tricks.as_strided(a, (len(a) - 2, 3), a.strides * 2)
This generates a trigram list with shape (2**28 - 2, 3) where each row is a trigram. Now I want to convert the trigram to a list of string (i.e. S3) so that numpy displays it more "reasonably" (instead of individual chars).
tri = tri.view('S3')
It gives the exception:
ValueError: To change to a dtype of a different size, the array must be C-contiguous
I understand generally data should be contiguous in order to create a meaningful view, but this data is contiguous at "where it should be": each three elements are contiguous.
So I'm wondering how to view contiguous part in non-contiguous np.ndarray as dtype of bigger size? A more "standard" way would be better, while hackish ways are also welcome. It seems that I can set shape and stride freely with np.lib.stride_tricks.as_strided, but I can't force the dtype to be something, which is the problem here.
EDIT
Non-contiguous array can be made by simple slicing. For example:
np.empty((8, 4), 'uint32')[:, :2].view('uint64')
will throw the same exception above (while from a memory point of view I should be able to do this). This case is much more common than my example above.
If you have access to a contiguous array from which your non-contiguous one is derived, it should typically be possible to work around this limitation.
For example your trigrams can be obtained like so:
>>> a = np.random.randint(0, 256, 2**28, 'B').view('c')
>>> a
array([b')', b'\xf2', b'\xf7', ..., b'\xf4', b'\xf1', b'z'], dtype='|S1')
>>> np.lib.stride_tricks.as_strided(a[:0].view('S3'), ((2**28)-2,), (1,))
array([b')\xf2\xf7', b'\xf2\xf7\x14', b'\xf7\x14\x1b', ...,
b'\xc9\x14\xf4', b'\x14\xf4\xf1', b'\xf4\xf1z'], dtype='|S3')
In fact, this example demonstrates that all we need is a contiguous "stub" at the memory buffer's base for view casting, since afterwards, because as_strided does not do many checks we are essentially free to do whatever we like.
It seems we can always get such a stub by slicing to a size 0 array. For your second example:
>>> X = np.empty((8, 4), 'uint32')[:, :2]
>>> np.lib.stride_tricks.as_strided(X[:0].view(np.uint64), (8, 1), X.strides)
array([[140133325248280],
[ 32],
[ 32083728],
[ 31978800],
[ 0],
[ 29686448],
[ 32],
[ 32362720]], dtype=uint64)
As of numpy 1.23.0, you will be able to do exactly what you want without jumping through extra hoops. I've added PR#20722 to numpy to address pretty much this exact issue. The idea is that if your new dtype is smaller than the current, you can clearly expand a unit or contiguous axis without any problems. If the new dtype is larger, you can shrink a contiguous axis.
With the update, your code runs out of the box:
>>> a = np.random.randint(0, 256, 2**28, 'B').view('c')
>>> a
array([b'\x19', b'\xf9', b'\r', ..., b'\xc3', b'\xa3', b'{'], dtype='|S1')
>>> tri = np.lib.stride_tricks.as_strided(a, (len(a)-2,3), a.strides*2)
>>> tri.view('S3')
array([[b'\x9dB\xeb'],
[b'B\xebU'],
[b'\xebU\xa4'],
...,
[b'-\xcbM'],
[b'\xcbM\x97'],
[b'M\x97o']], dtype='|S3')
The array has to have a unit dimension or be contiguous in the last axis, which is true in your case.
I've also added PR#20694 to introduce slicing to the np.char module. If that PR gets accepted as-is, you will be able to do:
>>> np.char.slice_(a.view(f'U{len(a)}'), step=1, chunksize=3)

How does Python Numpy save memory compared to a list? [duplicate]

This question already has answers here:
What are the advantages of NumPy over regular Python lists?
(8 answers)
Closed 3 months ago.
I came across the following piece of code while studying Numpy:
import numpy as np
import time
import sys
S= range(1000)
print(sys.getsizeof(5)*len(S))
D= np.arange(1000)
print(D.size*D.itemsize)
The output of this is:
O/P - 14000
4000
So Numpy saves memory storage. But I want to know how does Numpy do it?
Source: https://www.edureka.co/blog/python-numpy-tutorial/
Edit: This question only answers half of my question. Doesn't mention anything regarding what the Numpy module does.
In your example, D.size == len(S), so the difference is due to the difference between D.itemsize (8) and sys.getsizeof(5) (28).
D.dtype shows you that NumPy used int64 as the data type, which uses (unsurprisingly) 64 bits == 8 bytes per item. This is really only the raw numerical data, similar to a data type in C (under the hood it pretty much is exactly that).
In contrast, Python uses an int for storing the items, which (as pointed out the question linked to by FlyingTeller) is more than just the raw numerical data.
A ndarray stores its data in a contiguous data buffer
For an example in my current ipython session:
In [63]: x.shape
Out[63]: (35, 7)
In [64]: x.dtype
Out[64]: dtype('int64')
In [65]: x.size
Out[65]: 245
In [66]: x.itemsize
Out[66]: 8
In [67]: x.nbytes
Out[67]: 1960
The array referenced by x has a block of memory with info like shape and strides, and this data buffer that takes up 1960 bytes.
Identifying the memory use of a list, e.g. xl = x.tolist() is trickier. len(xl) is 35, that is, it's databuffer has 35 pointers. But each pointer references a different list of 7 elements. Each of those lists has pointers to numbers. In my example the numbers are all integers less than 255, so each is unique (repeats point to the same object). For larger integers and floats there will be a separate Python object for each. So the memory footprint of a list depends on the degree of nesting as well as the type of the individual elements.
ndarray can also have object dtype, in which case it too contains pointers to objects elsewhere in memory.
And another nuance - the primary pointer buffer of a list is slightly oversized, to make append faster.

numpy matrix subset view

I want to view a numpy matrix by specifying the row and column number. For example, row 0 and 2 and column 0 and 2 of a 3×3 matrix.
M = np.array(range(9)).reshape((3,3))
M[:,[0,2]][[0,2],:]
But I know this is not a view, a new matrix is created due to the iterated indexing. Is it possible to do such a view?
I think it is strange that i can do
M[:2,:2]
to view the matrix. but not use
M[[0,1],[0,1]]
to achieve the same view.
EDIT: provide one more example. If I have a matrix
M = np.array(range(16)).reshape((4,4))
How do I get rows [1,2,3] and columns [0,2,3] with a single step of indexing? This will do it in 2 steps:
M[[1,2,3],:][:,[0,2,3]]
How do I get rows [1,2,3] and columns [0,2,3] with a single step of indexing?
You could use np.ix_ instead but this is neither less typing nor is it faster. In fact its slower:
%timeit M[np.ix_([1,2,3],[0,2,3])]
100000 loops, best of 3: 17.8 µs per loop
%timeit M[[1,2,3],:][:, [0,2,3]]
100000 loops, best of 3: 10.9 µs per loop
How to force a view (if possible)?
You can use numpy.lib.stride_tricks.as_strided to ask for a tailored view of an array.
Here is an example of its use from scipy-lectures.
This would allow you to get a view instead of a copy in your very first example:
from numpy.lib.stride_tricks import as_strided
M = np.array(range(9)).reshape((3,3))
sub_1 = M[:,[0,2]][[0,2],:]
sub_2 = as_strided(M, shape=(2, 2), strides=(48,16))
print sub_1
print ''
print sub_2
[[0 2]
[6 8]]
[[0 2]
[6 8]]
# change the initial array
M[0,0] = -1
print sub_1
print ''
print sub_2
[[0 2]
[6 8]]
[[-1 2]
[ 6 8]]
As you can see sub_2 is indeed a view since it reflects changes made to the initial array M.
The strides argument passed to as_strided specifies the byte-sizes to "walk" in each dimension:
The datatype of the initial array M is numpy.int64 (on my machine) so an int is 8 bytes in memory. Since Numpy arranges arrays by default in C-style (row-major order), one row of M is consecutive in memory and takes 24 bytes. Since you want every other row you specify 48 bytes as stride in the first dimension. For the second dimension you want also every other element -- which now sit next to each other in memory -- so you specify 16 bytes as stride.
For your latter example Numpy is not able to return a view because the requested indices are to irregular to be described through shape and strides.
For your second example:
import numpy as np
M = np.array(range(16)).reshape((4,4))
print(M[np.meshgrid([1,2,3],[0,2,3])].transpose())
the .transpose() is necessary because of meshgrid's order of indexing. According to Numpy doc there is a new indexing option, so that M[np.meshgrid([1,2,3],[0,2,3],indexing='ij')] should work, but I don't have Numpy's latest version and can't test it.
M[[0,1],[0,1]] returns elements at (0,0) and (1,1) in the matrix.
Slicing a numpy array gives a view of the array, but your code M[:2, :2] gets a submatrix with row 0,1 and column 0,1 of M, you need ::2:
In [1710]: M
Out[1710]:
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
In [1711]: M[:2, :2]
Out[1711]:
array([[0, 1],
[3, 4]])
In [1712]: M[::2, ::2]
Out[1712]:
array([[0, 2],
[6, 8]])
To understand this behavior of numpy, you need to read up on numpy array striding. The great power of numpy lies in providing a uniform interface for the whole numpy/scipy ecosystem to grow around. That interface is the ndarray, which provides a simple yet general method for storing numerical data.
'Simple' and 'general' are value judgements of course, but a balance has been struck by settling on strided arrays to form this interface. Every numpy array has a set of strides that tells you how to find any given element in the array, as a simple inner product between strides and indices.
Of course one could imagine an alternative numpy which had different code paths for all kinds of other data representations; much in the same way as one could imagine the pyramids of Giza, except ten times bigger. Easy to imagine; but building it is a little more work.
What is however impossible to imagine, is indexing an array as arr[[2,0,1]], and representing that array as a strided view on the same piece of memory. arr[[1,0]] on the other hand could be represented as a view, but returning a view or copy depending on the content of the indices you are indexing with would mean a performance hit for what should be a simple operation; and it would make for rather funny semantics as well.

Categories

Resources