I am attempting to implement an algorithm (in Python) which involves a growing forest. The number of nodes are fixed, and at each step an edge is added. Throughout the course of the algorithm I need to keep track of the roots of the trees. This is a fairly common problem, e.g. Kruskal's Algorithm. Naively one might compute the root on the fly, but my forest is too large to make this feasable. A second attempt might be to keep a dictionary keyed by the nodes and whose values are the roots of the tree containing the node. This seems more promising, but I need to avoid updating the dictionary value of every node in two trees to be merged (the trees eventually get very deep and this is too computationally expensive). I was hopeful when I found the topic:
Simulating Pointers in Python
The notion was to keep a pointer to the root of each tree and simply update the roots when trees were merged. However, I quickly ran into the following (undesirable) behavior:
class ref:
def __init__(self,obj): self.obj = obj
def get(self): return self.obj
def set(self,obj): self.obj=obj
a = ref(1)
b = ref(2)
c = ref(3)
a = b
b = c
print(a,b,c) # => 2, 3, 3
Of course the desired output would be 3,3,3. I I check the addresses at each step I find that a and b are indeed pointing to the same thing (after a=b), but that a is not updated when I set b=c.
a = ref(1)
b = ref(2)
c = ref(3)
print(id(a),id(b),id(c)) # => 140512500114712 140512500114768 140512500114824
a = b
print(id(a),id(b),id(c)) # => 140512500114768 140512500114768 140512500114824
b = c
print(id(a),id(b),id(c)) # => 140512500114768 140512500114824 140512500114824
My primary concern is to be able to track to roots of trees when they are merged without a costly update, I would take any reasonable solution on this front whether or not it relates to the ref class. My secondary concern is to understand why Python is behaving this way with the ref class and how to modify the class to get the desired behavior. Any help or insight with regards to these problems is greatly appreciated.
When a=b is executed, the computer is getting the value of b. It calls b.get(), so 2 is returned. Therefore, a points to 2, not b.
If you used a.set(b) instead, then a would point to b. (I hope!)
Let me know if this works and if anything needs more clarification.
Related
I was playing around in python. I used the following code in IDLE:
p = [1, 2]
p[1:1] = [p]
print p
The output was:
[1, [...], 2]
What is this […]? Interestingly I could now use this as a list of list of list up to infinity i.e.
p[1][1][1]....
I could write the above as long as I wanted and it would still work.
EDIT:
How is it represented in memory?
What's its use? Examples of some cases where it is useful would be helpful.
Any link to official documentation would be really useful.
This is what your code created
It's a list where the first and last elements are pointing to two numbers (1 and 2) and where the middle element is pointing to the list itself.
In Common Lisp when printing circular structures is enabled such an object would be printed as
#1=#(1 #1# 2)
meaning that there is an object (labelled 1 with #1=) that is a vector with three elements, the second being the object itself (back-referenced with #1#).
In Python instead you just get the information that the structure is circular with [...].
In this specific case the description is not ambiguous (it's backward pointing to a list but there is only one list so it must be that one). In other cases may be however ambiguous... for example in
[1, [2, [...], 3]]
the backward reference could either point to the outer or to the inner list.
These two different structures printed in the same way can be created with
x = [1, [2, 3]]
x[1][1:1] = [x[1]]
y = [1, [2, 3]]
y[1][1:1] = [y]
print(x)
print(y)
and they would be in memory as
It means that you created an infinite list nested inside itself, which can not be printed. p contains p which contains p ... and so on. The [...] notation is a way to let you know this, and to inform that it can't be represented! Take a look at #6502's answer to see a nice picture showing what's happening.
Now, regarding the three new items after your edit:
This answer seems to cover it
Ignacio's link describes some possible uses
This is more a topic of data structure design than programming languages, so it's unlikely that any reference is found in Python's official documentation
To the question "What's its use", here is a concrete example.
Graph reduction is an evaluation strategy sometime used in order to interpret a computer language. This is a common strategy for lazy evaluation, notably of functional languages.
The starting point is to build a graph representing the sequence of "steps" the program will take. Depending on the control structures used in that program, this might lead to a cyclic graph (because the program contains some kind of "forever" loop -- or use recursion whose "depth" will be known at evaluation time, but not at graph-creation time)...
In order to represent such graph, you need infinite "data structures" (sometime called recursive data structures), like the one you noticed. Usually, a little bit more complex though.
If you are interested in that topic, here is (among many others) a lecture on that subject: http://undergraduate.csse.uwa.edu.au/units/CITS3211/lectureNotes/14.pdf
We do this all the time in object-oriented programming. If any two objects refer to each other, directly or indirectly, they are both infinitely recursive structures (or both part of the same infinitely recursive structure, depending on how you look at it). That's why you don't see this much in something as primitive as a list -- because we're usually better off describing the concept as interconnected "objects" than an "infinite list".
You can also get ... with an infinitely recursive dictionary. Let's say you want a dictionary of the corners of a triangle, where each value is a dictionary of the other corners connected to that corner. You could set it up like this:
a = {}
b = {}
c = {}
triangle = {"a": a, "b": b, "c": c}
a["b"] = b
a["c"] = c
b["a"] = a
b["c"] = c
c["a"] = a
c["b"] = b
Now if you print triangle (or a or b or c for that matter), you'll see it's full of {...} because any two corners are referring to back to each other.
As I understood, this is an example of fixed point
p = [1, 2]
p[1:1] = [p]
f = lambda x:x[1]
f(p)==p
f(f(p))==p
I have seen multiple representations of adjacency list of a graph and I do not know which one to use.
I am thinking of the following representation of a Node object and Graph object (as below)
class Node(object):
def __init__(self, val):
self.val = val
self.connections_distance = {}
# key = node: val = distance
def add(self, neighborNode, distance):
if neighborNode not in self.connections_distance:
self.connections_distance[neighborNode] = distance
class Graph(object):
def __init__(self):
self.nodes = {}
# key = node.val : val = node object
# multiple methods
The second way is nodes are labelled 0 - n - 1 (n is number of nodes). Each node stores it adjacency as an array of linked lists (where the index is the node value and the linked list stores all of its neighbors)
ex. graph:
0 connected to 1 and 2
1 connected to 0 and 2
2 connected to 0 and 1
Or if [a, b, c] is and array containing a, b, and c and [x -> y -> z] is a linked list containing x, y, and z:
representation: [[1->2], [0->2], [0->1]]
Question : What are the pros and cons of each representation and which is more widely used?
Note: It's a bit odd that one representation includes distances and the other doesn't. It's pretty easy to them to both include distances or both omit them though, so I'll omit that detail (you might be interested in set() rather than {}).
It looks like both representations are variants of an Adjacency List (explained further in https://stackoverflow.com/a/62684297/3798897). Conceptually there isn't much difference between the two representations -- you have a collection of nodes, and each node has a reference to a collection of neighbors. Your question is really two separate problems:
(1) Should you use a dictionary or an array to hold the collection of nodes?
They're nearly equivalent; a dictionary isn't much more than an array behind the scenes. If you don't have a strong reason to do otherwise, relying on the built-in dictionary rather than re-implementing one with your own hash function and a dense array will probably be the right choice.
A dictionary will use a bit more space
Dictionary deletions from a dictionary will be much faster (and so will insertions if you actually mean an array and not python's list)
If you have a fast way to generate the number 1-n for each node then that might work better than the hash function a dictionary uses behind the scenes, so you might want to use an array.
(2) Should you use a set or a linked list to hold the collection of adjacent nodes?
Almost certainly you want a set. It's at least as good asymptotically as a list for anything you want to do with a collection of neighbors, it's more cache friendly, it has less object overhead, and so on.
As always, your particular problem can sway the choice one way or another. E.g., I mentioned that an array has worse insertion/deletion performance than a dictionary, but if you hardly ever insert/delete then that won't matter, and the slightly reduced memory would start to look attractive.
This is a bottom up approach to check if the tree is an AVL tree or not. So how this code works is:
Suppose this is a tree :
8
3 10
2
1
The leaf node is checked that it is a leaf node(here 1). It then unfolds one recursion when the node with data 2 is the current value. The value of cl = 1, while it compares the right tree. The right branch of 2 is empty i.e does not have any children so the avl_compare will have (1, 0) which is allowed.
After this i want to add one value to cl so that when the node with data 3 is the current value, the value of cl = 2. avl_check is an assignment question. I have done this on my own but i need some help here to play with recursive functions.
def avl_check(self):
cl = cr = 0
if(self.left):
self.left.avl_check()
cl+=1
if(self.right):
self.right.avl_check()
cr += 1
if(not self.avl_compare(cl,cr)):
print("here")
Your immediate problem is that you don't seem to understand local and global variables. cl and cr are local variables; with the given control flow, the only values they can ever have are 0 and 1. Remember that each instance of the routine gets a new set of local variables: you set them to 0, perhaps increment to 1, and then you return. This does not affect the values of the variables in other instances of the function.
A deeper problem is that you haven't thought this through for larger trees. Assume that you do learn to use global variables and correct these increments. Take your current tree, insert nodes 4, 9, 10, and 11 (nicely balanced). Walk through your algorithm, tracing the values of cl and cr. By the time you get to node 10, cl is disturbingly more than the tree depth -- I think this is a fatal error in your logic.
Think through this again: a recursive routine should not have global variables, except perhaps for the data store of a dynamic programming implementation (which does not apply here). The function should check for the base case and return something trivial (such as 0 or 1). Otherwise, the function should reduce the problem one simple step and recur; when the recursion returns, the function does something simple with the result and returns the new result to its parent.
Your task is relatively simple:
Find the depths of the two subtrees.
If their difference > 1, return False
else return True
You should already know how to check the depth of a tree. Implement this first. After that, make your implementation more intelligent: checking the depth of a subtree should also check its balance at each step. That will be your final solution.
I was playing around in python. I used the following code in IDLE:
p = [1, 2]
p[1:1] = [p]
print p
The output was:
[1, [...], 2]
What is this […]? Interestingly I could now use this as a list of list of list up to infinity i.e.
p[1][1][1]....
I could write the above as long as I wanted and it would still work.
EDIT:
How is it represented in memory?
What's its use? Examples of some cases where it is useful would be helpful.
Any link to official documentation would be really useful.
This is what your code created
It's a list where the first and last elements are pointing to two numbers (1 and 2) and where the middle element is pointing to the list itself.
In Common Lisp when printing circular structures is enabled such an object would be printed as
#1=#(1 #1# 2)
meaning that there is an object (labelled 1 with #1=) that is a vector with three elements, the second being the object itself (back-referenced with #1#).
In Python instead you just get the information that the structure is circular with [...].
In this specific case the description is not ambiguous (it's backward pointing to a list but there is only one list so it must be that one). In other cases may be however ambiguous... for example in
[1, [2, [...], 3]]
the backward reference could either point to the outer or to the inner list.
These two different structures printed in the same way can be created with
x = [1, [2, 3]]
x[1][1:1] = [x[1]]
y = [1, [2, 3]]
y[1][1:1] = [y]
print(x)
print(y)
and they would be in memory as
It means that you created an infinite list nested inside itself, which can not be printed. p contains p which contains p ... and so on. The [...] notation is a way to let you know this, and to inform that it can't be represented! Take a look at #6502's answer to see a nice picture showing what's happening.
Now, regarding the three new items after your edit:
This answer seems to cover it
Ignacio's link describes some possible uses
This is more a topic of data structure design than programming languages, so it's unlikely that any reference is found in Python's official documentation
To the question "What's its use", here is a concrete example.
Graph reduction is an evaluation strategy sometime used in order to interpret a computer language. This is a common strategy for lazy evaluation, notably of functional languages.
The starting point is to build a graph representing the sequence of "steps" the program will take. Depending on the control structures used in that program, this might lead to a cyclic graph (because the program contains some kind of "forever" loop -- or use recursion whose "depth" will be known at evaluation time, but not at graph-creation time)...
In order to represent such graph, you need infinite "data structures" (sometime called recursive data structures), like the one you noticed. Usually, a little bit more complex though.
If you are interested in that topic, here is (among many others) a lecture on that subject: http://undergraduate.csse.uwa.edu.au/units/CITS3211/lectureNotes/14.pdf
We do this all the time in object-oriented programming. If any two objects refer to each other, directly or indirectly, they are both infinitely recursive structures (or both part of the same infinitely recursive structure, depending on how you look at it). That's why you don't see this much in something as primitive as a list -- because we're usually better off describing the concept as interconnected "objects" than an "infinite list".
You can also get ... with an infinitely recursive dictionary. Let's say you want a dictionary of the corners of a triangle, where each value is a dictionary of the other corners connected to that corner. You could set it up like this:
a = {}
b = {}
c = {}
triangle = {"a": a, "b": b, "c": c}
a["b"] = b
a["c"] = c
b["a"] = a
b["c"] = c
c["a"] = a
c["b"] = b
Now if you print triangle (or a or b or c for that matter), you'll see it's full of {...} because any two corners are referring to back to each other.
As I understood, this is an example of fixed point
p = [1, 2]
p[1:1] = [p]
f = lambda x:x[1]
f(p)==p
f(f(p))==p
my first time posting here, so hope I've asked my question in the right sort of way,
After adding an element to a Python dictionary, is it possible to get Python to tell you if adding that element caused a collision? (And how many locations the collision resolution strategy probed before finding a place to put the element?)
My problem is: I am using dictionaries as part of a larger project, and after extensive profiling, I have discovered that the slowest part of the code is dealing with a sparse distance matrix implemented using dictionaries.
The keys I'm using are IDs of Python objects, which are unique integers, so I know they all hash to different values. But putting them in a dictionary could still cause collisions in principle. I don't believe that dictionary collisions are the thing that's slowing my program down, but I want to eliminate them from my enquiries.
So, for example, given the following dictionary:
d = {}
for i in xrange(15000):
d[random.randint(15000000, 18000000)] = 0
can you get Python to tell you how many collisions happened when creating it?
My actual code is tangled up with the application, but the above code makes a dictionary that looks very similar to the ones I am using.
To repeat: I don't think that collisions are what is slowing down my code, I just want to eliminate the possibility by showing that my dictionaries don't have many collisions.
Thanks for your help.
Edit: Some code to implement #Winston Ewert's solution:
n = 1500
global collision_count
collision_count = 0
class Foo():
def __eq__(self, other):
global collision_count
collision_count += 1
return id(self) == id(other)
def __hash__(self):
#return id(self) # #John Machin: yes, I know!
return 1
objects = [Foo() for i in xrange(n)]
d = {}
for o in objects:
d[o] = 1
print collision_count
Note that when you define __eq__ on a class, Python gives you a TypeError: unhashable instance if you don't also define a __hash__ function.
It doesn't run quite as I expected. If you have the __hash__ function return 1, then you get loads of collisions, as expected (1125560 collisions for n=1500 on my system). But with return id(self), there are 0 collisions.
Anyone know why this is saying 0 collisions?
Edit:
I might have figured this out.
Is it because __eq__ is only called if the __hash__ values of two objects are the same, not their "crunched version" (as #John Machin put it)?
Short answer:
You can't simulate using object ids as dict keys by using random integers as dict keys. They have different hash functions.
Collisions do happen. "Having unique thingies means no collisions" is wrong for several values of "thingy".
You shouldn't be worrying about collisions.
Long answer:
Some explanations, derived from reading the source code:
A dict is implemented as a table of 2 ** i entries, where i is an integer.
dicts are no more than 2/3 full. Consequently for 15000 keys, i must be 15 and 2 ** i is 32768.
When o is an arbitrary instance of a class that doesn't define __hash__(), it is NOT true that hash(o) == id(o). As the address is likely to have zeroes in the low-order 3 or 4 bits, the hash is constructed by rotating the address right by 4 bits; see the source file Objects/object.c, function _Py_HashPointer
It would be a problem if there were lots of zeroes in the low-order bits, because to access a table of size 2 ** i (e.g. 32768), the hash value (often much larger than that) must be crunched to fit, and this is done very simply and quickly by taking the low order i (e.g. 15) bits of the hash value.
Consequently collisions are inevitable.
However this is not cause for panic. The remaining bits of the hash value are factored into the calculation of where the next probe will be. The likelihood of a 3rd etc probe being needed should be rather small, especially as the dict is never more than 2/3 full. The cost of multiple probes is mitigated by the cheap cost of calculating the slot for the first and subsequent probes.
The code below is a simple experiment illustrating most of the above discussion. It presumes random accesses of the dict after it has reached its maximum size. With Python2.7.1, it shows about 2000 collisions for 15000 objects (13.3%).
In any case the bottom line is that you should really divert your attention elsewhere. Collisions are not your problem unless you have achieved some extremely abnormal way of getting memory for your objects. You should look at how you are using the dicts e.g. use k in d or try/except, not d.has_key(k). Consider one dict accessed as d[(x, y)] instead of two levels accessed as d[x][y]. If you need help with that, ask a seperate question.
Update after testing on Python 2.6:
Rotating the address was not introduced until Python 2.7; see this bug report for comprehensive discussion and benchmarks. The basic conclusions are IMHO still valid, and can be augmented by "Update if you can".
>>> n = 15000
>>> i = 0
>>> while 2 ** i / 1.5 < n:
... i += 1
...
>>> print i, 2 ** i, int(2 ** i / 1.5)
15 32768 21845
>>> probe_mask = 2 ** i - 1
>>> print hex(probe_mask)
0x7fff
>>> class Foo(object):
... pass
...
>>> olist = [Foo() for j in xrange(n)]
>>> hashes = [hash(o) for o in olist]
>>> print len(set(hashes))
15000
>>> probes = [h & probe_mask for h in hashes]
>>> print len(set(probes))
12997
>>>
This idea doesn't actually work, see discussion in the question.
A quick look at the C implementation of python shows that the code for resolving collisions does not calculate or store the number of collisions.
However, it will invoke PyObject_RichCompareBool on the keys to check if they match. This means that __eq__ on the key will be invoked for every collision.
So:
Replace your keys with objects that define __eq__ and increment a counter when it is called. This will be slower because of the overhead involved in jumping into python for the compare. However, it should give you an idea of how many collisions are happening.
Make sure you use different objects as the key, otherwise python will take a shortcut because an object is always equal to itself. Also, make sure the objects hash to the same value as the original keys.