Object Creation in Python - What happenes to the modified variable? - python

After getting in touch with the more deeper workings of Python, I've come to the understanding that assigning a variable equals to the creation of a new object which has their own specific address regardless of the variable-name that was assigned to the object.
In that case though, it makes me wonder what happens to an object that was created and modified later on. Does it sit there and consumes memory?
The scenarion in mind looks something like this:
# Creates object with id 10101001 (random numbers)
x = 5
# Creates object with id 10010010 using the value from object 10101001.
x += 10
What happens to object with the id 10101001?
Out of curiosity too, why do objects need an ID AND a refrence that is the variable name, wouldn't it be better to just assign the address with the variable name?
I apologize in advance for the gringe this question might invoke in someone.

Here is a great talk that was given at PyCon by Ned Batchelder this year about how Python manages variables.
https://www.youtube.com/watch?v=_AEJHKGk9ns
I think it will help clear up some of your confusion.

First of all Augmented assignment statements states:
An augmented assignment expression like x += 1 can be rewritten x = x + 1 to achieve a similar, but not exactly equal effect. In the augmented version, x is only evaluated once. Also, when possible, the actual operation is performed in-place, meaning that rather than creating a new object and assigning that to the target, the old object is modified instead.
So depending on the type of x this might not create a new object.
Python is reference counted. So the reference count of the object with id 10101001 decremented. If this count hits zero, the is freed almost immediately. But most low range integers are cached anyways. Refer to Objects, Types and Reference Counts for all the details.
Regarding the id of an object:
CPython implementation detail: This is the address of the object in memory.
So basically id and reference are the same. The variable name is just a binding to the object itself.

Related

Garbage collection for a simple python class

I am writing a python class like this:
class MyImageProcessor:
def __init__ (self, image, metadata):
self.image=image
self.metadata=metadata
Both image and metadata are objects of a class written by a
colleague. Now I need to make sure there is no waste of memory. I am thinking of defining a quit() method like this,
def quit():
self.image=None
self.metadata=None
import gc
gc.collect()
and suggest users to call quit() systematically. I would like to know whether this is the right way. In particular, do the instructions in quit() above guarantee that unused memories being well collected?
Alternatively, I could rename quit() to the build-in __exit__(), and suggest users to use the "with" syntax. But my question is
more about whether the instructions in quit() indeed fulfill the garbage collection work one would need in this situation.
Thank you for your help.
In python every object has a built-in reference_count, the variables(names) you create are only pointers to the objects. There are mutable and unmutable variables (for example if you change the value of an integer, the name will be pointed to another integer object, while changing a list element will not cause changing of the list name).
Reference count basically counts how many variable uses that data, and it is incremented/decremented automatically.
The garbage collector will destroy the objects with zero references (actually not always, it takes extra steps to save time). You should check out this article.
Similarly to object constructors (__init__()), which are called on object creation, you can define destructors (__del__()), which are executed on object deletion (usually when the reference count drops to 0). According to this article, in python they are not needed as much needed in C++ because Python has a garbage collector that handles memory management automatically. You can check out those examples too.
Hope it helps :)
No need for quit() (Assuming you're using C-based python).
Python uses two methods of garbage collection, as alluded to in the other answers.
First, there's reference counting. Essentially each time you add a reference to an object it gets incremented & each time you remove the reference (e.g., it goes out of scope) it gets decremented.
From https://devguide.python.org/garbage_collector/:
When an object’s reference count becomes zero, the object is deallocated. If it contains references to other objects, their reference counts are decremented. Those other objects may be deallocated in turn, if this decrement makes their reference count become zero, and so on.
You can get information about current reference counts for an object using sys.getrefcount(x), but really, why bother.
The second way is through garbage collection (gc). [Reference counting is a type of garbage collection, but python specifically calls this second method "garbage collection" -- so we'll also use this terminology. ] This is intended to find those places where reference count is not zero, but the object is no longer accessible. ("Reference cycles") For example:
class MyObj:
pass
x = MyObj()
x.self = x
Here, x refers to itself, so the actual reference count for x is more than 1. You can call del x but that merely removes it from your scope: it lives on because "someone" still has a reference to it.
gc, and specifically gc.collect() goes through objects looking for cycles like this and, when it finds an unreachable cycle (such as your x post deletion), it will deallocate the whole lot.
Back to your question: You don't need to have a quit() object because as soon as your MyImageProcessor object goes out of scope, it will decrement reference counters for image and metadata. If that puts them to zero, they're deallocated. If that doesn't, well, someone else is using them.
Your setting them to None first, merely decrements the reference count right then, but when MyImageProcessor goes out of scope, it won't decrement those reference count again, because MyImageProcessor no longer holds the image or metadata objects! So you're just explicitly doing what python does for you already for free: no more, no less.
You didn't create a cycle, so your calling gc.collect() is unlikely to change anything.
Check out https://devguide.python.org/garbage_collector/ if you are interested in more earthy details.
Not sure if it make sense but to my logic you could
Use :
gc.get_count()
before and after
gc.collect()
to see if something has been removed.
what are count0, count1 and count2 values returned by the Python gc.get_count()

Python Assignment or Variable binding?

x = 3
x = 4
Is the second line an assignment statement or a new variable binding?
In (very basic) C terms, when you assign a new value to a variable this is what happens:
x = malloc(some object struct)
If I'm interpreting your question correctly, you're asking what happens when you reassign x - this:
A. *x = some other value
or this:
B. x = malloc(something else)
The correct answer is B, because the object the variable points to can be also referred to somewhere else and changing it might affect other parts of the program in an unpredictable way. Therefore, Python unbinds the variable name from the old structure (decreasing its "reference counter"), allocates a new structure and binds the name to this new one. Once reference counter of a structure becomes zero, it becomes garbage and will be freed at some point.
Of course, this workflow is highly optimized internally, and details may vary depending on the object itself, specific interpreter (CPython, Jython etc) and from version to version. As userland python programmers, we only have a guarantee that
x = old_object
and then
x = new_object
doesn't affect "old_object" in any way.
There is no difference. Assigning to a name in Python is the same whether or not the name already existed.

Why should I refer to "names" and "binding" in Python instead of "variables" and "assignment"?

Why should I refer to "names" and "binding" in Python instead of "variables" and "assignment"?
I know this question is a bit general but I really would like to know :)
In C and C++, a variable is a named memory location. The value of the variable is the value stored in that location. Assign to the variable and you modify that value. So the variable is the memory location, not the name for it.
In Python, a variable is a name used to refer to an object. The value of the variable is that object. So far sounds like the same thing. But assign to the variable and you don't modify the object itself, rather you alter which object the variable refers to. So the variable is the name, not the object.
For this reason, if you're considering the properties of Python in the abstract, or if you're talking about multiple languages at once, then it's useful to use different names for these two different things. To keep things straight you might avoid talking about variables in Python, and refer to what the assignment operator does as "binding" rather than "assignment".
Note that The Python grammar talks about "assignments" as a kind of statement, not "bindings". At least some of the Python documentation calls names variables. So in the context of Python alone, it's not incorrect to do the same. Different definitions for jargon words apply in different contexts.
In, for example, C, a variable is a location in memory identified by a specific name. For example, int i; means that there is a 4-byte (usually) variable identified by i. This memory location is allocated regardless of whether a value is assigned to it yet. When C runs i = 1000, it is changing the value stored in the memory location i to 1000.
In python, the memory location and size is irrelevant to the interpreter. The closest python comes to a "variable" in the C sense is a value (e.g. 1000) which exists as an object somewhere in memory, with or without a name attached. Binding it to a name happens by i = 1000. This tells python to create an integer object with a value of 1000, if it does not already exist, and bind to to the name 'i'. An object can be bound to multiple names quite easily, e.g:
>>> a = [] # Create a new list object and bind it to the name 'a'
>>> b = a # Get the object bound to the name 'a' and bind it to the name 'b'
>>> a is b # Are the names 'a' and 'b' bound to the same object?
True
This explains the difference between the terms, but as long as you understand the difference it doesn't really matter which you use. Unless you're pedantic.
I'm not sure the name/binding description is the easiest to understand, for example I've always been confused by it even if I've a somewhat accurate understanding of how Python (and cpython in particular) works.
The simplest way to describe how Python works if you're coming from a C background is to understand that all variables in Python are indeed pointers to objects and for example that a list object is indeed an array of pointers to values. After a = b both a and b are pointing to the same object.
There are a couple of tricky parts where this simple model of Python semantic seems to fail, for example with list augmented operator += but for that it's important to note that a += b in Python is not the same as a = a + b but it's a special increment operation (that can also be defined for user types with the __iadd__ method; a += b is indeed a = a.__iadd__(b)).
Another important thing to understand is that while in Python all variables are indeed pointers still there is no pointer concept. In other words you cannot pass a "pointer to a variable" to a function so that the function can change the variable: what in C++ is defined by
void increment(int &x) {
x += 1;
}
or in C by
void increment(int *x) {
*x += 1;
}
in Python cannot be defined because there's no way to pass "a variable", you can only pass "values". The only way to pass a generic writable place in Python is to use a callback closure.
who said you should? Unless you are discussing issues that are directly related to name binding operations; it is perfectly fine to talk about variables and assignments in Python as in any other language. Naturally the precise meaning is different in different programming languages.
If you are debugging an issue connected with "Naming and binding" then use this terminology because Python language reference uses it: to be as specific and precise as possible, to help resolve the problem by avoiding unnecessary ambiguity.
On the other hand, if you want to know what is the difference between variables in C and Python then these pictures might help.
I would say that the distinction is significant because of several of the differences between C and Python:
Duck typing: a C variable is always an instance of a given type - in Python it isn't the type that a name refers to can change.
Shallow copies - Try the following:
>>> a = [4, 5, 6]
>>> b = a
>>> b[1] = 0
>>> a
[4, 0, 6]
>>> b = 3
>>> a
[4, 0, 6]
This makes sense as a and b are both names that spend some of the time bound to a list instance rather than being separate variables.

Why do Python variables take a new address (id) every time they're modified?

Just wondering what the logic behind this one is? On the surface it seems kind of inefficient, that every time you do something simple like "x=x+1" that it has to take a new address and discard the old one.
The Python variable (called an identifier or name, in Python) is a reference to a value. The id() function says something for that value, not the name.
Many values are not mutable; integers, strings, floats all do not change in place. When you add 1 to another integer, you return a new integer that then replaces the reference to the old value.
You can look at Python names as labels, tied to values. If you imagine values as balloons, you are retying the label a new balloon each time you assign to that name. If there are no other labels attached to a balloon anymore, it simply drifts away in the wind, never to be seen again. The id() function gives you a unique number for that balloon.
See this previous answer of mine where I talk a little bit more about that idea of values-as-balloons.
This may seem inefficient. For many often used and small values, Python actually uses a process called interning, where it will cache a stash of these values for re-use. None is such a value, as are small integers and the empty tuple (()). You can use the intern() function to do the same with strings you expect to use a lot.
But note that values are only cleaned up when their reference count (the number of 'labels') drops to 0. Loads of values are reused all over the place all the time, especially those interned integers and singletons.
Because the basic types are immutable, so every time you modify it, it needs to be instantiated again
...which is perfectly fine, especially for thread-safe functions
The = operator doesn't modify an object, it assigns the name to a completely different object, which may or may not already have an id.
For your example, integers are immutable; there's no way to add something to one and keep the same id.
And, in fact, small integers are interned at least in cPython, so if you do:
x = 1
y = 2
x = x + 1
Then x and y may have the same id.
In python "primitive" types like ints and strings are immutable, which means they can not be modified.
Python is actually quite efficient, because, as #Wooble commented, «Very short strings and small integers are interned.»: if two variables reference the same (small) immutable value their id is the same (reducing duplicated immutables).
>>> a = 42
>>> b = 5
>>> id(a) == id(b)
False
>>> b += 37
>>> id(a) == id(b)
True
The reason behind the use of immutable types is a safe approach to the concurrent access on those values.
At the end of the day it depends on a design choice.
Depending on your needs you can take more advantage of an implementation instead of another.
For instance, a different philosophy can be found in a somewhat similar language, Ruby, where those types that in Python are immutable, are not.
To be accurate, assignment x=x+1 doesn't modify the object that x is referencing, it just lets the x point to another object whose value is x+1.
To understand the logic behind, one needs to understand the difference between value semantics and reference semantics.
An object with value semantics means only its value matters, not its identity. While an object with reference semantics focuses on its identity(in Python, identity can be returned from id(obj)).
Typically, value semantics implies immutability of the object. Or conversely, if an object is mutable(i.e. in-place change), that means it has reference semantics.
Let's briefly explain the rationale behind this immutability.
Objects with reference semantics can be changed in-place without losing their original addresses/identities. This makes sense in that it's the identity of an object with reference semantics that makes itself distinguishable from other objects.
In contrast, an object with value-semantics should never change itself.
First, this is possible and reasonable in theory. Since only the value(not its identity) is significant, when a change is needed, it's safe to swap it to another identity with different value. This is called referential transparency. Be noted that this is impossible for the objects with reference semantics.
Secondly, this is beneficial in practice. As the OP thought, it seems inefficient to discard the old objects each time when it's changed , but most time it's more efficient than not. For one thing, Python(or any other language) has intern/cache scheme to make less objects to be created. What's more, if objects of value-semantics were designed to be mutable, it would take much more space in most cases.
For example, Date has a value semantics. If it's designed to be mutable, any method that returning a date from internal field will exposes the handle to outside world, which is risky(e.g. outside can directly modify this internal field without resorting to public interface). Similarly, if one passes any date object by reference to some function/method, this object could be modified in that function/method, which may be not as expected. To avoid these kinds of side-effect, one has to do defensive programming: instead of directly returning the inner date field, he returns a clone of it; instead of passing by reference, he passes by value which means extra copies are made. As one could imagine, there are more chances to create more objects than necessary. What's worse, code becomes more complicated with these extra cloning.
In a word, immutability enforces the value-semantics, it usually involves less object creation, has less side-effects and less hassles, and is more test-friendly. Besides, immutable objects are inherently thread-safe, which means less locks and better efficiency in multithreading environment.
That's the reason why basic data types of value-semantics like number, string, date, time are all immutable(well, string in C++ is an exception, that's why there're so many const string& stuffs to avoid string being modified unexpectedly). As a lesson, Java made mistakes on designing value-semantic class Date, Point, Rectangle, Dimension as mutable.
As we know, objects in OOP have three characteristics: state, behavior and identity. Objects with value semantics are not typical objects in that their identities do not matter at all. Usually they are passive, and mostly used to describe other real, active objects(i.e. those with reference semantics). This is a good hint to distinguish between value semantics and reference semantics.

Python immutable object from within function

I asked a previous question on stackoverflow here: Python immutable types in function calls
which made it clear that only references to immutable objects are passed to functions, and so passing a tuple to a function does not result in a full memory copy of that object.
However, according to: http://www.testingreflections.com/node/view/5126
"Some objects, like strings, tuples,
and numbers, are immutable. Altering
them inside a function/method will
create a new instance and the original
instance outside the function/method
is not changed."
I wrote some test code, where an immutable object is passed to a function. As expected, I can modify the object via the parameter-name/reference defined as part of the function header, and all changes only persist within the called function, leaving the original object outside of the function untouched.
So my question is:
Is the new instance created only when an attempt is made to alter/modify the object passed in? I'm guessing that if the object is not changed, a reference to it is all that is required. More importantly, if it does create a copy upon attempted modification, how does python manage the memory? With a zero-copy/copy-on-write, or does it create a complete replicated object (with the whole object's size reserved in memory) visible only within the called function?
You will think a lot more clearly about variables in Python if you think of them not as boxes that contain values, but names that are attached to objects. Any object can have any number of names attached to it; some of the names are local to functions and will be taken off the object automatically when the function returns.
So when you do something like this:
name = "Slartibartfast"
person = name
There is a string object, which contains the text "Slartibartfast", and there are two names by which it can be referred: name and person. You get the same object in both cases; you can verify this with the id() function.
Which is the "real" name of the string, name or person? This is a trick question. The string does not inherently have a name; it is just a string. name is not a box that contains "Slartibartfast", it is just an identifier that refers to the object. person has exactly the same standing in Python; name is not "more important" just because it was assigned first.
NOTE: Some objects, such as functions and classes, have a __name__ attribute that holds the name that was used to declare it in the def or class statement. This is the object's "real name" if it can be said to have one. However, you can still reference it through any number of assigned names.
Now, suppose you "modify" the string to give it a bit more of a Dutch flavor:
person = person.replace("art", "aart")
"Modify" is in quotes because you can't modify a string. Since a string is immutable, every string operation creates a new string. When does it happen? Immediately. In this case, the new string "Slaartibaartfast" is created and the name person is adjusted to refer to that. However, the name name still refers to the original string, because you haven't told it to refer to anything else. As long as at least one name refers to it, Python will keep good old "Slartibartfast" around.
This is no different when dealing with a function:
def dutchnametag(name):
name = name.replace("art", "aart")
print "HELLO! My Dutch name is", name
person = "Slartibartfast"
dutchnametag(person)
Here we assign the string "Slartibartfast" to the global name person. We then pass that string to our function, where it receives the additional local name name. The string's replace() method is then called through the name identifier, creating a new string. The identifier name is then reassigned to point to the new string. Outside the function, the global identifier person still refers to the original string, because nothing has changed it to point to anything else.
I'm not speaking about python per se. But generally, in immutable data structures, every method that you use that needs to change state will return a new object (with the modified state). The old one will remain the same.
For example, a Java mutable list could have:
void addItem(Object item) { ... }
the correspondent immutable List would have a method in the lines of
List addItem(Object item) { ... }
So, there is generally nothing special about immutable data structures. In any language you may create immutable data structures. But some languages make it hard or impossible to create mutable data structures (generally, functional languages).
Some languages may provide pseudo-immutable data structures. They make some special data structures look like immutable to the coder, while indeed they aren't.
If an object is immutable there is no way to change it. You could assign a new object to the name formerly associated with the argument object. To do this you first need to make a new object. So yes, you would allocate space for a complete new object.

Categories

Resources