memory usage: numpy-arrays vs python-lists - python

Numpy is known for optimized arrays and various advantages over python-lists.
But when I check for the memory usage python-lists have less space than the numpy arrays.
The code I used is entered below.
Can anyone explain me why?
import sys
Z = np.zeros((10,10),dtype = int)
A = [[0] * 10] * 10
print(A,'\n',f'{sys.getsizeof(A)} bytes')
print(Z,'\n',f'{Z.size * Z.itemsize} bytes')

You're not measuring correctly; the native Python list only contains 10 references. You need to add in the collective size of the sub-lists as well:
>>> sys.getsizeof(A) + sum(map(sys.getsizeof, A))
1496
And it might get worse: each element inside the sub-lists could also be a reference (to an int). It's difficult to check whether the Python implementation is optimizing this away and storing the actual numbers inside the list.
You're also under-representing the size of the numpy array, because it includes a header:
>>> Z.size * Z.itemsize
800
>>> sys.getsizeof(Z)
912
In either case it's not an exact science and will depend on your platform and Python implementation.

According to the spec (https://docs.python.org/3/library/sys.html#sys.getsizeof), "only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to."
Also "getsizeof calls object’s " sizeof method
So you are given the size of just the container (a list object).
Please check https://code.activestate.com/recipes/577504/ for a complete size computation, which returns 296bytes for your example since only two unique objects are used. A list [0 0 0 0 0 0 0 0 0 0] and int 0.
If you initialize the list with different values, overall size will increase and will become bigger than np.array, which reserves 4bytes for numpy.int32 type elements, plus the size of its own internal data structure.
Find detailed info with examples here: https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html

Related

How much memory is used by the underlying buffer of a broadcasted numpy array?

Note that nbytes doesn't provide the correct value. For instance:
>>> n = 1000
>>> x = np.arange(n)
>>> bx = np.broadcast_to(x, (int(1e15), n))
>>> bx.nbytes
8e18
....which probably requires more RAM than exists on Earth.
EDIT: More specifically, is there a way to obtain the size of the buffer that bx refers to? Something along the lines of:
>>> x.nbytes
8000
>>> bx.underlying_buffer_size()
8000
Note that as you can see in the docs, broadcast_to returns a view, where the broadcasted array may refer to a single memory location, from the docs:
broadcast : array
A readonly view on the original array with the given shape. It is
typically not contiguous. Furthermore, more than one element of a
broadcasted array may refer to a single memory location.
Hence, in this case all new rows are pointing to the same memory location.
In order to see the actual of size of the object in bytes you can use sys.getsizeof:
from sys import getsizeof
getsizeof(bx)
112
This can be seen by checking which is the actual identity of the inner arrays:
id(bx[0])
# 1434315204368
id(bx[1])
# 1434315203968

How does Python Numpy save memory compared to a list? [duplicate]

This question already has answers here:
What are the advantages of NumPy over regular Python lists?
(8 answers)
Closed 3 months ago.
I came across the following piece of code while studying Numpy:
import numpy as np
import time
import sys
S= range(1000)
print(sys.getsizeof(5)*len(S))
D= np.arange(1000)
print(D.size*D.itemsize)
The output of this is:
O/P - 14000
4000
So Numpy saves memory storage. But I want to know how does Numpy do it?
Source: https://www.edureka.co/blog/python-numpy-tutorial/
Edit: This question only answers half of my question. Doesn't mention anything regarding what the Numpy module does.
In your example, D.size == len(S), so the difference is due to the difference between D.itemsize (8) and sys.getsizeof(5) (28).
D.dtype shows you that NumPy used int64 as the data type, which uses (unsurprisingly) 64 bits == 8 bytes per item. This is really only the raw numerical data, similar to a data type in C (under the hood it pretty much is exactly that).
In contrast, Python uses an int for storing the items, which (as pointed out the question linked to by FlyingTeller) is more than just the raw numerical data.
A ndarray stores its data in a contiguous data buffer
For an example in my current ipython session:
In [63]: x.shape
Out[63]: (35, 7)
In [64]: x.dtype
Out[64]: dtype('int64')
In [65]: x.size
Out[65]: 245
In [66]: x.itemsize
Out[66]: 8
In [67]: x.nbytes
Out[67]: 1960
The array referenced by x has a block of memory with info like shape and strides, and this data buffer that takes up 1960 bytes.
Identifying the memory use of a list, e.g. xl = x.tolist() is trickier. len(xl) is 35, that is, it's databuffer has 35 pointers. But each pointer references a different list of 7 elements. Each of those lists has pointers to numbers. In my example the numbers are all integers less than 255, so each is unique (repeats point to the same object). For larger integers and floats there will be a separate Python object for each. So the memory footprint of a list depends on the degree of nesting as well as the type of the individual elements.
ndarray can also have object dtype, in which case it too contains pointers to objects elsewhere in memory.
And another nuance - the primary pointer buffer of a list is slightly oversized, to make append faster.

Python Numpy integers high memory use

I'm trying to store 25 million integers efficiently in Python. Preferably using Numpy so I can perform operations on them afterwards.
The idea is to have them stored as 4-bytes unsigned integers. So the required memory should be 25M entries * 4 bytes = 95 MB.
I have written the following test code and the reported memory consumption is almost 700 MB, why?
a = np.array([], dtype=np.uint32)
# TEST MEMORY FOR PUTTING 25 MILLION INTEGERS IN MEMORY
for i in range(0, 25000000):
np.append(a, np.asarray([i], dtype=np.uint32))
If I do this for example, it works as expected:
a = np.random.random_integers(1, 25000000, size=25000000)
Why?
Actually the problem is range(0, 25000000) because this creates a list composed of ints.
The memory needed for containing such a list is (for simplicity assuming 32bytes per integer) 25kk * 32B = 800kkB ~ 762MB.
Use the generator-like xrange or update to python3 there range is a bit less memory expensive (because there the values are not precomputed but evaluated when needed).
The actual numpy array is always empty (since you only create a temporary copy with np.append - the result of the append is not stored and therefore discarded directly afterwards!) and therefore negligable.
I would work with your a = np.random.random_integers(1, 25000000, size=25000000) and convert it (if you want) to np.uint afterwards.

Python Memory Size

If I have a list of numbers and if I implement the same thing using dictionary, will they both occupy the same memory space?
Eg.
list = [[0,1],[1,0]]
dict = {'0,0' = 0 , '0,1' = 1, '1,0' = 1, '1,1' = 0 }
will the memory size of list and dict be the same? which will take up more space?
If you're using Python 2.6 or higher, you can use sys.getsizeof() to do a test
>>> import sys
>>> aList = [[0,1],[1,0]]
>>> aDict = {'0,0' : 0 , '0,1' : 1, '1,0' : 1, '1,1' : 0 }
>>> sys.getsizeof(aList)
44
>>> sys.getsizeof(aDict)
140
With the example you provided, we see that aDict takes up more space in memory.
There are very good chances that the dictionary will be bigger.
A dictionary and a list are both fairly different data structures. When it boils down to memory and processor instructions, a list is fairly straightforward: all values in it are contiguous, and when you want to access item n, you go at the beginning of the list, move n items forward, and return it. This is easy, because list items are contiguous and have integer keys.
On the other hand, constraints for dictionaries are fairly different. You can't just go to the beginning of the dictionary, move key item forwards and return it, because key might just not be numeric. Besides, keys don't need to be contiguous.
In the case of a dictionary, you need a structure to find the values associated to keys very easily, even though there might be no relationship between them. Therefore, it can't use the same kind of algorithms a list use. And typically, the data structures required by dictionaries are bigger than data structures required for lists.
Coincidentally, both may have the same size, even though it would be kind of surprising. Though, the data representation will be different, no matter what.
Getting the size of objects in python is tricky. Ostensibly, this can be done with sys.getsizeof but it returns incomplete data.
An empty list has size usage of 32 bytes on my system.
>>> sys.getsizeof([])
32
A list with one element has size of 36 bytes. This does not seem to vary according to the element.
>>> sys.getsizeof([[1, 2]])
36
>>> sys.getsizeof([1])
36
So you would need to know the size of the inner list as well.
>>> sys.getsizeof([1, 2])
40
So your memory usage for a list (assuming the same as my system) should be 32 bytes plus 44 bytes for every internal list. This is because python is storing the overhead involved with keeping a list which costs 32 bytes. Every addition entry is represented as a pointer to that object and costs 4 bytes. So the pointer costs 4 bytes on top of whatever you are storing. In the case of two element lists, this is 40 bytes.
For a dict, it's 136 for an empty dict
>>> sys.getsizeof({})
136
From there, it will quadruple its size as adding members causes it to run out of space and risk frequent hash collisions. Again, you have to include the size of the object being stored and the keys as well.
>>> sys.getsizeof({1: 2})
136
In short, they won't be the same. The list should perform better, unless you can have sparsely populated lists that could be implemented in a dict.
As mentioned by several others, getsizeof doesn't sum the contained objects.
Here is a recipe that does some work for you on standard python (3.0) types.
Compute Memory footprint of an object and its contents
Using this recipe on python 3.1 here are some results:
aList = [[x,x] for x in range(1000)]
aListMod10 = [[x%10,x%10] for x in range(1000)]
aTuple = [(x,x) for x in range(1000)]
aDictString = dict(("%s,%s" % (x,x),x) for x in range(1000))
aDictTuple = dict(((x,x),x) for x in range(1000))
print("0", total_size(0))
print("10", total_size(10))
print("100", total_size(100))
print("1000", total_size(1000))
print("[0,1]", total_size([0,1]))
print("(0,1)", total_size((0,1)))
print("aList", total_size(aList))
print("aTuple", total_size(aTuple))
print("aListMod10", total_size(aListMod10))
print("aDictString", total_size(aDictString))
print("aDictTuple", total_size(aDictTuple))
print("[0]'s", total_size([0 for x in range(1000)]))
print("[x%10]'s", total_size([x%10 for x in range(1000)]))
print("[x%100]'s", total_size([x%100 for x in range(1000)]))
print("[x]'s", total_size([x for x in range(1000)]))
Output:
0 12
10 14
100 14
1000 14
[0,1] 70
(0,1) 62
aList 62514
aTuple 54514
aListMod10 48654
aDictString 82274
aDictTuple 74714
[0]'s 4528
[x%10]'s 4654
[x%100]'s 5914
[x]'s 18514
It seems to logically follow that the best performing memory option would be to use 2 lists:
list_x = [0, 1, ...]
list_y = [1, 0, ...]
This may only be worth it if you are running tight on memory and your list is expected to be large. I would guess that the usage pattern would be creating (x,y) tuples all over the place anyway, so it may be that you should really just do:
tuples = [(0, 1), (1, 0), ...]
All things being equal, choose what allows you to write the most readable code.

Memory efficient int-int dict in Python

I need a memory efficient int-int dict in Python that would support the following operations in O(log n) time:
d[k] = v # replace if present
v = d[k] # None or a negative number if not present
I need to hold ~250M pairs, so it really has to be tight.
Do you happen to know a suitable implementation (Python 2.7)?
EDIT Removed impossible requirement and other nonsense. Thanks, Craig and Kylotan!
To rephrase. Here's a trivial int-int dictionary with 1M pairs:
>>> import random, sys
>>> from guppy import hpy
>>> h = hpy()
>>> h.setrelheap()
>>> d = {}
>>> for _ in xrange(1000000):
... d[random.randint(0, sys.maxint)] = random.randint(0, sys.maxint)
...
>>> h.heap()
Partition of a set of 1999530 objects. Total size = 49161112 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1 0 25165960 51 25165960 51 dict (no owner)
1 1999521 100 23994252 49 49160212 100 int
On average, a pair of integers uses 49 bytes.
Here's an array of 2M integers:
>>> import array, random, sys
>>> from guppy import hpy
>>> h = hpy()
>>> h.setrelheap()
>>> a = array.array('i')
>>> for _ in xrange(2000000):
... a.append(random.randint(0, sys.maxint))
...
>>> h.heap()
Partition of a set of 14 objects. Total size = 8001108 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1 7 8000028 100 8000028 100 array.array
On average, a pair of integers uses 8 bytes.
I accept that 8 bytes/pair in a dictionary is rather hard to achieve in general. Rephrased question: is there a memory-efficient implementation of int-int dictionary that uses considerably less than 49 bytes/pair?
You could use the IIBtree from Zope
I don't know if this is a one-shot solution, or part of an ongoing project, but if it's the former, is throwing more ram at it cheaper than the necessary developer time to optimize the memory usage? Even at 64 bytes per pair, you're still only looking at 15GB, which would fit easily enough into most desktop boxes.
I think the correct answer probably lies within the SciPy/NumPy libraries, but I'm not familiar enough with the library to tell you exactly where to look.
http://docs.scipy.org/doc/numpy/reference/
You might also find some useful ideas in this thread:
Memory Efficient Alternatives to Python Dictionaries
8 bytes per key/value pair would be pretty difficult under any implementation, Python or otherwise. If you don't have a guarantee that the keys are contiguous then either you'd waste a lot of space between the keys by using an array representation (as well as needing some sort of dead value to indicate a null key), or you'd need to maintain a separate index to key/value pairs which by definition would exceed your 8 bytes per pair (even if only by a small amount).
I suggest you go with your array method, but the best approach will depend on the nature of the keys I expect.
How about a Judy array if you're mapping from ints? It is kind of a sparse array... Uses 1/4th of the dictionary implementation's space.
Judy:
$ cat j.py ; time python j.py
import judy, random, sys
from guppy import hpy
random.seed(0)
h = hpy()
h.setrelheap()
d = judy.JudyIntObjectMap()
for _ in xrange(4000000):
d[random.randint(0, sys.maxint)] = random.randint(0, sys.maxint)
print h.heap()
Partition of a set of 4000004 objects. Total size = 96000624 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 4000001 100 96000024 100 96000024 100 int
1 1 0 448 0 96000472 100 types.FrameType
2 1 0 88 0 96000560 100 __builtin__.weakref
3 1 0 64 0 96000624 100 __builtin__.PyJudyIntObjectMap
real 1m9.231s
user 1m8.248s
sys 0m0.381s
Dictionary:
$ cat d.py ; time python d.py
import random, sys
from guppy import hpy
random.seed(0)
h = hpy()
h.setrelheap()
d = {}
for _ in xrange(4000000):
d[random.randint(0, sys.maxint)] = random.randint(0, sys.maxint)
print h.heap()
Partition of a set of 8000003 objects. Total size = 393327344 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1 0 201326872 51 201326872 51 dict (no owner)
1 8000001 100 192000024 49 393326896 100 int
2 1 0 448 0 393327344 100 types.FrameType
real 1m8.129s
user 1m6.947s
sys 0m0.559s
~1/4th the space:
$ echo 96000624 / 393327344 | bc -l
.24407309958089260125
(I'm using 64bit python, btw, so my base numbers may be inflated due to 64bit pointers)
Looking at your data above, that's not 49 bytes per int, it's 25. The other 24 bytes per entry are the int objects themselves. So you need something that is significantly smaller than 25 bytes per entry. Unless you also are going to reimplement the int objects, which is possible for the key hashes, at least. Or implement it in C, where you can skip the objects completely (this is what Zopes IIBTree does, mentioned above).
To be honest the Python dictionary is highly tuned in various ways. It will not be easy to beat it, but good luck.
I have implemented my own int-int dictionary, available here (BSD license). In short, I use array.array('i') to store key-value pairs sorted by keys. In fact, instead of one large array, I keep a dictionary of smaller arrays (a key-value pair is stored in the key/65536th array) in order to speed up shifting during insertion and binary search during retrieval. Each array stores the keys and values in the following way:
key0 value0 key1 value1 key2 value2 ...
Actually, it is not only an int-int dictionary, but a general object-int dictionary with objects reduced to their hashes. Thus, the hash-int dictionary can be used as a cache of some persistently stored dictionary.
There are three possible strategies of handling "key collisions", that is, attempts to assign a different value to the same key. The default strategy allows it. The "deleting" removes the key and marks it as colliding, so that any further attempts to assign a value to it will have no effect. The "shouting" strategy throws an exception during any overwrite attempt and on any further access to any colliding key.
Please see my answer to a related question for a differently worded description of my approach.

Categories

Resources