Does Python stores similar objects at memory locations nearer to each other?
Because id of similar objects, say lists and tuples, are nearer to each other than an object of type str.
No, except of course by coincidence. While this is highly implementation- and environment-specific, and there are actually memory management schemes which would dedicate page-sized memory regions to objects of the same type, no Python implementations I'm aware of exhibits the behavior you describe. With the possible exception of small numbers, which are sometimes cached under the hood and will likely be located next to each other.
What you're seeing may be because string literals are created at import time (part of the constants in the byte code) and interned, while lists and tuples (that don't contain literals) are created while running code. If a bunch of memory is allocated in between (especially if it isn't freed), the state of the heap may be sufficiently different that quite different addresses are handed out when you're checking.
Related
How can I see the size (i.e. memory usage) of all objects currently in memory?
Running the following does not provide correct figures for memory usage
[(m, sys.getsizeof(m)) for m in dir() if not m.startswith('__')]
# Returns
# ('ElasticNet', 59),
# ('ElasticNetCV', 61),
# ('In', 51),
# ...
Whereas for example sys.getsizeof(ElasticNet) shows me that ElasticNet has a size of 1064.
Additionally, is there a convenient tool for assessing which objects are taking up large amounts of RAM, so as to delete and garbage collect them during the script or session?
NB Total memory used by Python process points out how to profile memory at the level of the Python process, whereas I want to determine memory usage for all objects separately, and moreover do so conveniently by retrieving size of the object that m (inside the list comprehension) is referencing.
sys.getsizeof is not recursive, it just gives you the "intrinsic" size of the object you pass in. So e.g. for an instance of a regular pure-python class it'll pretty much always give you 48 (for recent Python versions), with __slots__ it'll be 32 + n_slots * 8, that's because a regular instance is really an empty shell with a __dict__ and a __weakref__ pointer, whereas if you add __slots__ all the fields are "inline" (not stored through a dict) and the __weakref__ pointer is only included if you add it. If you want to know the "full" size of an object you'll have to recurse into them, but the insight you get from that is complicated by e.g. shared ownership (if you have a 1KB list which is associated with two different objects, do you count it twice? If not, to which object's tally does it contribute?).
Python classes are pretty large objects, so each class is about 1k (there's some variation depending on the class' specifics).
If you want to have more visibility into object sizes, the tools to do so are in the gc module, which let you iterate essentially everything in various ways. Though that's very low level and not exactly convenient, a memory profiler of some sort (pympler, guppy, memory_profiler) will usually offer higher-level tools, though that doesn't make their usage trivial.
I have a large structure of primitive types within nested dict/list. The structure is quite complicated and doesn't really matter.
If I represent it in python's built-in types (dict/list/float/int/str) it takes 1.1 GB, but if I store it in protobuf and load it to memory it is significantly smaller. ~250 MB total.
I'm wondering how can this be. Are the built-in types in python inefficient in comparison to some external library?
Edit: The structure is loaded from json file. So no internal references between objects
"Simple" python objects, such as int or float, need much more memory than their C-counterparts used by protobuf.
Let's take a list of Python integers as example compared to an array of integers, as for example in an array.array (i.e. array.array('i', ...)).
The analysis for array.array is simple: discarding some overhead from the array.arrays-object, only 4 bytes (size of a C-integer) are needed per element.
The situation is completely different for a list of integers:
the list holds not the integer-objects themselves but pointers to the objects (8 additional bytes for a 64bit executable)
even a small non-zero integer needs at least 28 bytes (see import sys; sys.getsizeof(1) returns 28): 8 bytes are needed for reference counting, 8 bytes to hold a pointer to the integer-function table, 8 bytes are needed for the size of the integer value (Python's integers can be much bigger than 2^32), and at least 4 byte to hold the integer value itself.
there is also an overhead for memory management of 4.5 bytes.
This means there is a whopping cost of 40.5 bytes per Python integer compared to the possible 4 bytes (or 8 bytes if we use long long int, i.e. 64bit integers).
A situation is similar for a list with Python floats compared to an array of doubles( i.e. array.array('d',...)), which only needs about 8 bytes per element. But for list we have:
the list holds not the float objects themselves but pointers to the objects (8 additional bytes for a 64bit executable)
a float object needs 24 bytes (see import sys; sys.getsizeof(1.0) returns 24): 8 bytes are needed for reference counting, 8 bytes to hold a pointer to the float-function table, and 8 bytes to hold the double-value itself.
because 24 is a multiple of 8, the overhead for memory management is "only" about 0.5 bytes.
Which means 32.5 bytes for a Python float object vs. 8 byte for a C-double.
protobuf uses internally the same representation of the data as array.array and thus needs much less memory (about 4-5 times less, as you observe). numpy.array is another example for a data type, which holds raw C-values and thus needs much less memory than lists.
If one doesn't need to search in a dictionary, then saving the key-values-pairs in a list will need less memory than in a dictionary, because one doesn't have to maintain a structure for searching (which imposes some memory costs) - this is also another thing that leads to smaller memory footprint of protobuf-data.
To answer your other question: There are no built-in modules which are to Python-dict, what array.array are to Python-list, so I use this opportunity to shamelessly plug-in an advertisement for a library of mine: cykhash.
Sets and maps from cykhash need less than 25% of Python'S-dict/set memory but are about the same fast.
This is normal and it's all about space vs. time tradeoff. Memory layout depends on the way how a particular data structure is implemented, which in turn depends on how it is going to be used.
A general-purpose dictionary is typically implemented with a hashtable. It has a fixed-size list of buckets that store key-value pairs. The number of items in a dictionary can be smaller, equal or bigger that number of buckets. If smaller, space is wasted. If bigger, dictionary operations take a long time. A hashtable implementation usually starts with a small initial bucket list, then grow it as new items are added to keep the performance decent. However, resizing also requires rehashing which is computationally very expensive, so whenever you do it, you want to leave some room for growth. General-purpose dictionaries are a trade-off between space and time because they don't "know" how many elements they are supposed to contain and because there is no perfect hash function. But in a good-enough case, a general-use hashtable will give you near-O(1) performance.
When data is serialized it's a different story. Data in transit does not change, you are not doing lookups with it, it is not subjected to garbage collection, boundary alignment and so on. This means you can simply pack keys and values one after another for space efficiency. You need virtually no metadata and no control structures as long as the values can be reconstructed. On the downside, manipulating packed data is very slow because all operations take O(n) time.
For this reason, you will almost always want to:
convert data from time-efficient into space-efficient format before sending it
convert data from space-efficient into time-efficient format after receiving it.
If you are using nested dictionaries (or lists, which are in many ways similar), the differences will add up and become even more pronounced. When you know the number of items in advance and the data does not change much, you can probably get some improvement by preallocating the memory for it, such as dict.fromkeys(range(count)).
I am trying to run a Python (2.7) script with PyPy but I have encountered the following error:
TypeError: sys.getsizeof() is not implemented on PyPy.
A memory profiler using this function is most likely to give results
inconsistent with reality on PyPy. It would be possible to have
sys.getsizeof() return a number (with enough work), but that may or
may not represent how much memory the object uses. It doesn't even
make really sense to ask how much *one* object uses, in isolation
with the rest of the system. For example, instances have maps,
which are often shared across many instances; in this case the maps
would probably be ignored by an implementation of sys.getsizeof(),
but their overhead is important in some cases if they are many
instances with unique maps. Conversely, equal strings may share
their internal string data even if they are different objects---or
empty containers may share parts of their internals as long as they
are empty. Even stranger, some lists create objects as you read
them; if you try to estimate the size in memory of range(10**6) as
the sum of all items' size, that operation will by itself create one
million integer objects that never existed in the first place.
Now, I really need to check the size of one nested dict during the execution of the program, is there any alternative to sys.getsizeof() I can use in PyPy? If not, how would I check for the size of a nested object in PyPy?
Alternatively you can gauge the memory usage of your process using
import resource
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
As your program is executing, getrusage will give total memory consumption of the process in number of bytes or kilobytes. Using this information you can estimate the size of your data structures, and if you begin to use say 50% of your machine's total memory, then you can do something to handle it.
I'm sorry if this has been asked before, but I'm not really sure what I'm looking for and I lack the domain knowledge to correctly frame the question which make answers rather hard to find!
Anyway, I am trying to implement in Python a simulated annealing algorithm from a paper (IBM J. Res. Dev., 2001; 45(3/4); 545). The author gives a clear outline of the algorithm which he has implemented in C++, however at the end of his definition he states the follwing
"To avoid repeated and potentially expensive memory allocation S and S* are implemented as a single object that is able to revert to its initial state after an unfavourable mutation."
(S and S* represent the original and step changed state of whatever is being optimised).
In a previous more naive version I have used two lists to hold each state, but his comment seems to suggest that such an approach is memory inefficient. Thus, my questions are:
Is his comment C++ specific and in Python I can continue to use lists and not worry about it?
If I do need to worry about it, what Python data structure should I use? Simply define a class with original and mutated attributes and a method to do the mutation or is there something else I'm missing?
I still need the two states, so does wrapping that in a class change the way the memory is allocated to make the class representation more compact?
You could theoretically log and replay all manipulations to a mutable Python object to achieve it's state prior to the misguided mutation. But that's painful and expensive. So is mapping logged events to inverse functions in such a way that a mutation can be reverted.
The trend with functional programming, however seems to strongly encourage expanded use of immutable data...especially with concurrent programming. This mindset isn't just limited to functional languages like Haskell, OCaml, and Erlang. It's even infiltrated the Java world:
An object is considered immutable if its state cannot change after it
is constructed. Maximum reliance on immutable objects is widely
accepted as a sound strategy for creating simple, reliable code.
Immutable objects are particularly useful in concurrent applications.
Since they cannot change state, they cannot be corrupted by thread
interference or observed in an inconsistent state.
Programmers are often reluctant to employ immutable objects, because
they worry about the cost of creating a new object as opposed to
updating an object in place. The impact of object creation is often
overestimated, and can be offset by some of the efficiencies
associated with immutable objects. These include decreased overhead
due to garbage collection, and the elimination of code needed to
protect mutable objects from corruption.
The following subsections take a class whose instances are mutable and
derives a class with immutable instances from it. In so doing, they
give general rules for this kind of conversion and demonstrate some of
the advantages of immutable objects.
Generate a new list with a map or list comprehension and modify accordingly. If Ram really is an issue, consider using a generator with that gives you an iterable with desired modification and a lower memory footprint.
It depends on the object being created. If it is a large object (in-memory) then creating two of them S and S' as lists or otherwise would be less memory efficient then designing a way to transform S -> S' and vice versa as long as the transformation itself does not consume much memory.
These type of Data Structures are classified as RetroActive Data Structures with a few other synonyms like Kinetic data structures.
Using a deque and managing all the object states is one way of doing it.
What is the main advantage of using the array module instead of lists?
The arrays will take less space.
I've never used the array module, numpy provides the same benefits plus many many more.
Arrays are very similar to lists "except that the type of objects stored in them is constrained. The type is specified at object creation time by using a type code, which is a single character."
http://docs.python.org/library/array.html
So if you know your array will only contain objects of a certain type then go ahead (if performance is crucial), if not just use a list.
Arrays can be very useful if you want to strictly type the data in your collection. Notwithstanding performance it can be quite convenient to be sure of the data types contained within your data structure. However, arrays don't feel very 'pythonic' to me (though I must stress this is only personal preference). Unless you are really concerned with the type of data within the collection I would stick with lists as they afford you a great deal of flexibility. The very slight memory optimisations gained between lists vs arrays are insignificant unless you have an extremely large amount of data.