How is memory handled in Python's Lists? - python

See the code below, as you see when a=[1,2] that is a homogeneous type the address of 1st and 2nd elements differed by 32 bits
but in second case when a=[1,'a',3],there is no relation between address of 1st and 2nd element but there is relation between 1st and 3rd element that is address differs by 64 bits.
So I want to know how is memory handled and how indexing takes place and how is it linked to being non hashable (that is mutable)
>>> a=[1,2]
>>> print(id(a[0]))
4318513456
>>> print(id(a[1]))
4318513488
>>> a=[1,'a',3]
>>> print(id(a[0]))
4318513456
>>> print(id(a[1]))
4319642992
>>> print(id(a[2]))
4318513520
>>>

In general, ids don't matter. Don't worry about ids. Don't look at ids. In fact, it's a CPython implementation detail that ids are memory addresses, because it's convenient. Another Python implementation might do something else.
In any case, you're seeing CPython's small integer cache (see e.g. this question), where certain integer objects are preallocated as "singleton" objects since they're expected to be used often. However, that too is an implementation detail.
The string 'a', on the other hand, is not cached (it might have been interned if your code had been loaded from a .py file on disk), so it's allocated from somewhere else.
As for your question about indexing, CPython (again, another implementation might do things differently) lists are, under the hood, arrays of pointers to PyObjects, so it's just an O(1) operation.

Related

Why memory space allocation is different for the same objects?

I was experimenting with how Python allocates the memory, so found the same issue like
Size of list in memory and Eli describes in a much better way. His answer leads me to the new doubt that, I checked the size of 1 + [] and [1], but it is different as you can see in the code snippet. if I'm not wrong memory space allocation should be the same. But it's not the case. Anyone can help me with the understanding?
>>> import sys
>>> sys.getsizeof(1)
28
>>> sys.getsizeof([])
64
>>> 28 + 64
92
>>> sys.getsizeof([1])
72
What's the minimum information a list needs to function?
some kind of top-level list object, containg a reference to the class information (methods, type info, etc), and the list's own instance data
the actual objects stored in the list
... that gets you the size you expected. But is it enough?
A fixed-size list object can only track a fixed number of list entries: traditionally just one (head) or two (head and tail).
Adding more entries to the list doesn't change the size of the list object itself, so there must be some extra information: the relationship between list elements.
It's possible to store this information in every Object (this is called an intrusive list), but it's very restrictive: each Object can only be stored in one list at a time.
Since Python lists clearly don't behave like that, we know this extra information isn't already in the list element, and it can't be inside the list object, so it must be stored elsewhere. Which increases the total size of the list.
NB. I've kept this argument fairly abstract deliberately. You could implement list a few different ways, but none of them avoid some extra storage for element relationships, even if the representation differs.

How does np.ndarray.tobytes() work for dtype "object"?

I encountered a strange behavior of np.ndarray.tobytes() that makes me doubt that it is working deterministically, at least for arrays of dtype=object.
import numpy as np
print(np.array([1,[2]]).dtype)
# => object
print(np.array([1,[2]]).tobytes())
# => b'0h\xa3\t\x01\x00\x00\x00H{!-\x01\x00\x00\x00'
print(np.array([1,[2]]).tobytes())
# => b'0h\xa3\t\x01\x00\x00\x00\x88\x9d)-\x01\x00\x00\x00'
In the sample code, a list of mixed python objects ([1, [2]]) is first converted to a numpy array, and then transformed to a byte sequence using tobytes().
Why do the resulting byte-representations differ for repeated instantiations of the same data? The documentation just states that it converts an ndarray to raw python bytes, but it does not refer to any limitations. So far, I observed this just for dtype=object. Numeric arrays always yield the same byte sequence:
np.random.seed(42); print(np.random.rand(3).tobytes())
# b'\xecQ_\x1ew\xf8\xd7?T\xd6\xbbh#l\xee?Qg\x1e\x8f~l\xe7?'
np.random.seed(42); print(np.random.rand(3).tobytes())
# b'\xecQ_\x1ew\xf8\xd7?T\xd6\xbbh#l\xee?Qg\x1e\x8f~l\xe7?'
Have I missed an elementar thing about python's/numpy's memory architecture? I tested with numpy version 1.17.2 on a Mac.
Context: I encountered this problem when trying to compute a hash for arbitrary data structures. I hoped that I can rely on the basic serialization capabilities of tobytes(), but this appears to be a wrong premise. I know that pickle is the standard for serialization in python, but since I don't require portability and my data structures only contain numbers, I first sought help with numpy.
An array of dtype object stores pointers to the objects it contains. In CPython, this corresponds to the id. Every time you create a new list, it will be allocated at a new memory address. However, small integers are interned, so 1 will reference the same integer object every time.
You can see exactly how this works by checking the IDs of some sample objects:
>>> x = np.array([1, [2]])
>>> x.tobytes()
b'\x90\x91\x04a\xfb\x7f\x00\x00\xc8[J\xaa+\x02\x00\x00'
>>> id(x[0])
140717641208208
>>> id(1) # Small integers are interned
140717641208208
>>> id(x[0]).to_bytes(8, 'little') # Checks out as the first 8 bytes
b'\x90\x91\x04a\xfb\x7f\x00\x00'
>>> id(x[1]).to_bytes(8, 'little') # Checks out as the last 8 bytes
b'\xc8[J\xaa+\x02\x00\x00'
As you can see, it is quite deterministic, but serializes information that is essentially useless to you. The operation is the same for numeric arrays as for object arrays: it returns a view or copy of the underlying buffer. The contents of the buffer is what is throwing you off.
Since you mentioned that you are computing hashes, keep in mind that there is a reason that python lists are unhashable. You can have lists that are equal at one time and different at another. Using IDs is generally not a good idea for an effective hash.

Python list I don't know why the difference is 16

>>> import sys
>>> print(sys.getsizeof(int()))
12
>>> print(sys.getsizeof(str()))
25
>>> mylist = [1,2,3,4,5,'ab']
>>> print(id(mylist))
50204144
>>> print(id(mylist[0]))
1849873456
>>> print(id(mylist[1]))
1849873472
>>> print(id(mylist[2]))
1849873488
>>> print(id(mylist[3]))
1849873504
>>> print(id(mylist[4]))
1849873520
>>> print(id(mylist[5]))
50209152
I don't know why the difference is 16:
64-bit operating system
Because they are int inside a list the location in memory of a integer goes 16 to 16 bits I truly recommend you to see this post: What is the id( ) function used for?
It looks like your question is: If sys.getsizeof(int()) is 12, then why are some of the the id() values 16 bytes apart instead of 12?
And it looks like you are expecting newly allocated ints to be 12 bytes away from one another because an int takes 12 bytes of storage.
If you are expecting that, it is because you are expecting a Python list to be like a C array, a chunk of contiguous memory where an array of five 8-byte objects takes exactly 40 bytes. But Python lists are not arrays, and list elements are not necessarily allocated in ascending memory order (let alone packed together). And so you can't expect the values of id() to be predictable from the amount of memory the object takes up.
By all means learn about the way Python's data structures are really allocated, if that really interests you. But it is a topic so advanced that few outside the CPython core team ever need to think about it. The rest of us are just content that it works. Which is why you are getting comments like it's an implementation detail and why do you care?
It is important to know how a C array is allocated because in C you manipulate memory pointers directly and getting that wrong can be catastrophic. But Python takes care of memory allocation for you, so knowing all the details of how it works is unlikely to make you a better Python programmer.

getsizeof returns the same value for seemingly different lists

I have the following two dimensional bitmap:
num = 521
arr = [i == '1' for i in bin(num)[2:].zfill(n*n)]
board = [arr[n*i:n*i+n] for i in xrange(n)]
Just for curiosity I wanted to check how much more space will it take, if it will have integers instead of booleans. So I checked the current size with sys.getsizeof(board) and got 104
After that I modified
arr = [int(i) for i in bin(num)[2:].zfill(n*n)] , but still got 104
Then I decided to see how much will I get with just strings:
arr = [i for i in bin(num)[2:].zfill(n*n)], which still shows 104
This looks strange, because I expected list of lists of strings to waste way more memory than just booleans.
Apparently I am missing something about how the getsizeof calculates the size. Can anyone explain me why I get such results.
P.S. thanks to zehnpard's answer, I see that I can use sum(sys.getsizeof(i) for line in board for i in line) to approximately count the memory (most probably it will not count the lists, which is not that much important for me). Now I see the difference in numbers for string and int/bool (no difference for int and boolean)
The docs for the sys module since Python 3.4 is pretty explicit:
Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.
Given that Python lists are effectively arrays of pointers to other Python objects, the number of elements a Python list contains will influence its size in memory (more pointers) but the type of objects contained will not (memory-wise, they aren't contained in the list, just pointed at).
To get the size of all items in a container, you need a recursive solution, and the docs helpfully provide a link to an activestate recipe.
http://code.activestate.com/recipes/577504/
Given that this recipe is for Python 2.x, I'm sure this behavior was always standard, and got explicitly mentioned in the docs since 3.4 onwards.

Are sets internally sorted, or is the __str__ method displaying a sorted list?

I have a set, I add items (ints) to it, and when I print it, the items apparently are sorted:
a = set()
a.add(3)
a.add(2)
a.add(4)
a.add(1)
a.add(5)
print a
# set([1, 2, 3, 4, 5])
I have tried with various values, apparently it needs to be only ints.
I run Python 2.7.5 under MacOSX. It is also reproduced using repl.it (see http://repl.it/TpV)
The question is: is this documented somewhere (haven't find it so far), is it normal, is it something that can be relied on?
Extra question: when is the sort done? during the print? is it internally stored sorted? (is that even possible given the expected constant complexity of insertion?)
This is a coincidence. The data is neither sorted nor does __str__ sort.
The hash values for integers equal their value (except for -1 and long integers outside the sys.maxint range), which increases the chance that integers are slotted in order, but that's not a given.
set uses a hash table to track items contained, and ordering depends on the hash value, and insertion and deletion history.
The how and why of the interaction between integers and sets are all implementation details, and can easily vary from version to version. Python 3.3 introduced hash randomisation for certain types, and Python 3.4 expanded on this, making ordering of sets and dictionaries volatile between Python process restarts too (depending on the types of values stored).

Categories

Resources