Pros and cons of different implementations of graph adjacency list - python

I have seen multiple representations of adjacency list of a graph and I do not know which one to use.
I am thinking of the following representation of a Node object and Graph object (as below)
class Node(object):
def __init__(self, val):
self.val = val
self.connections_distance = {}
# key = node: val = distance
def add(self, neighborNode, distance):
if neighborNode not in self.connections_distance:
self.connections_distance[neighborNode] = distance
class Graph(object):
def __init__(self):
self.nodes = {}
# key = node.val : val = node object
# multiple methods
The second way is nodes are labelled 0 - n - 1 (n is number of nodes). Each node stores it adjacency as an array of linked lists (where the index is the node value and the linked list stores all of its neighbors)
ex. graph:
0 connected to 1 and 2
1 connected to 0 and 2
2 connected to 0 and 1
Or if [a, b, c] is and array containing a, b, and c and [x -> y -> z] is a linked list containing x, y, and z:
representation: [[1->2], [0->2], [0->1]]
Question : What are the pros and cons of each representation and which is more widely used?

Note: It's a bit odd that one representation includes distances and the other doesn't. It's pretty easy to them to both include distances or both omit them though, so I'll omit that detail (you might be interested in set() rather than {}).
It looks like both representations are variants of an Adjacency List (explained further in https://stackoverflow.com/a/62684297/3798897). Conceptually there isn't much difference between the two representations -- you have a collection of nodes, and each node has a reference to a collection of neighbors. Your question is really two separate problems:
(1) Should you use a dictionary or an array to hold the collection of nodes?
They're nearly equivalent; a dictionary isn't much more than an array behind the scenes. If you don't have a strong reason to do otherwise, relying on the built-in dictionary rather than re-implementing one with your own hash function and a dense array will probably be the right choice.
A dictionary will use a bit more space
Dictionary deletions from a dictionary will be much faster (and so will insertions if you actually mean an array and not python's list)
If you have a fast way to generate the number 1-n for each node then that might work better than the hash function a dictionary uses behind the scenes, so you might want to use an array.
(2) Should you use a set or a linked list to hold the collection of adjacent nodes?
Almost certainly you want a set. It's at least as good asymptotically as a list for anything you want to do with a collection of neighbors, it's more cache friendly, it has less object overhead, and so on.
As always, your particular problem can sway the choice one way or another. E.g., I mentioned that an array has worse insertion/deletion performance than a dictionary, but if you hardly ever insert/delete then that won't matter, and the slightly reduced memory would start to look attractive.

Related

Data Structure for fast insertion and random access in already sorted data

p = random_point(a,b)
#random_point() returns a tuple/named-tuple (x,y)
#0<x<a 0<y<b
if centers.validates(p):
centers.insert(p)
#centers is the data structure to store points
In the centers data structure all x and y coordinates are stored in two separate sorted(ascending) lists, one for x and other for y. Each node in x points to the corresponding y, and vice versa, so that they can be separately sorted and still hold the pair property: centers.get_x_of(y) and centers.get_y_of(x)
Properties that I require in data structure:
Fast Insertion, in already sorted data (preferably log n)
Random access
Sort x and y separately, without losing pair property
Initially I thought of using simple Lists, and using Binary search to get the index for inserting any new element. But I found, that, it can be improved using self balancing trees like AVL or B-trees. I could make two trees each for x and y, with each node having an additional pointer that could point from x-tree node to y-tree node.
But I don't know how to build random access functionality in these trees. The function centers.validate() tries to insert x & y, and runs some checks with the neighboring elements, which requires random access:
def validate(p):
indices = get_index(p)
#returns a named tuple of indices to insert x and y, Eg: (3,7)
condition1 = func(x_list[indices.x-1], p.x) and func(x_list[indices.x+1], p.x)
condition2 = func(y_list[indices.y-1], p.y) and func(y_list[indices.y+1], p.y)
#func is some mathematical condition on neighboring elements of x and y
return condition1 and condition2
In the above function I need to access neighboring elements of x & y
data structure. I think implementing this in trees would complicate it. Are there any combination of data structure that can achieve this? I am writing this in Python(if that can help)
Class with 2 dicts that hold the values with the keys being the key of the other dict that contains the related value to the value in this dict. It would need to maintain a list per dict for the current order to call elements of that dict in when calling it (your current sort of that dicts values). You would need a binary or other efficient sort to operate on each dict for insertion, though it would really be using the order list for that dict to find each midpoint key and then checking against value from that key.

representing a tree as a list in python

I'm learning python and I'm curious about how people choose to store (binary) trees in python.
Is there something wrong in storing the nodes of the tree as a list in python? something like:
[0,1,2,3,4,5,6,7,8]
where the 0'th position is 0 by default, 1 is the root, and for each position (i), the 2i and 2i+1 positions are the children. When no child is present, we just have a 'None' in that position.
I've read a couple of books/notes where they represent a tree using a list of lists, or something more complicated than just a simple list like this, and I was wondering if there's something inherently wrong in how i'm looking at it?
You certainly COULD do this. I'd define it as a class deriving from list with a get_children method. However this is fairly ugly since either A) you'd have to preprocess the whole list in O(n) time to pair up indices with values or B) you'd have to call list.index in O(n log n) time to traverse the tree.
class WeirdBinaryTreeA(list):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def get_children(value):
"""Calls list.index on value to derive the children"""
idx = self.index(value) # O(n) once, O(n log n) to traverse
return self[idx * 2], self[idx * 2 + 1]
class WeirdBinaryTreeB(list):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.__mapping = self.processtree()
def processtree(self):
for idx, val in enumerate(self):
self.__mapping[val] = idx
def get_children(value):
"""Queries the mapping on value to derive the children"""
idx = self.__mapping[value] # O(1) once, O(n) to traverse
return self[idx * 2], self[idx * 2 + 1]
However the bigger question is why would you do this? What makes it better than a list of lists or a dict of dicts? What happens when you have:
A
/ \
B
/ \
C
/ \
D
/ \
E
/ \
F
And your list looks like:
[0, 'A', None, 'B', None, None, None, 'C', None, None, None, None, None, None, None, 'D', ...]
Instead of:
{"A": {"B": {"C": {"D": {"E": {"F": None}}}}}}
There's nothing wrong with storing a binary tree as a list the way you're doing - it's the same idea as storing it as a flat array in a language like C or Java. Accessing the parent of a given node is very fast, and finding the children is also pretty efficient.
I suppose a lot of examples and tutorials will prefer to use a representation that's 'really tree shaped' (list of lists or objects) - it might be a bit more intuitive to explain.
I've seen representations like this (your flat list/array) used in C code, and representations like this could be acceptable in Python too, but it depends on the nature of the data you're handling. In C code, a balanced tree in this list representation can be very fast to access (much quicker than navigating a series of pointers), though the performance benefit in Python may be less noticeable due to all the other overhead.
For reasonably balanced dense trees, this flat list approach is reasonable. However, as Adam Smith commented, this type of flat list tree would become extremely wasteful for unbalanced sparse trees. Suppose you have one branch with single children going down a hundred levels, and the rest of the tree had nothing. You would need 2^100 + 2^99 + 2^98 + ... + 2^1 + 1 spots in the flat list tree. For such a case, you would use up a huge amount of memory for something that could be represented much more efficiently with nested lists.
So in essence, the choice between flat list trees vs. nested list trees is similar to the choice between flat array trees and pointer based trees in a C like language.

python: can I have a sparse matrix representation without (explicitly) using integer indices?

I have a dataset that is essentially a sparse binary matrix that represents relationships between elements of two sets. For example, let the 1st set be people (represented by their names), e.g. somehting like this:
people = set(['john','jane','mike','joe'])
and the 2nd set be a bunch of binary attributes, e.g.
attrs = set(['likes_coffee','has_curly_hair','has_dark_hair','drives_car','man_u_fan'])
The dataset is represented by a tab-separated data file that assigns some of the attributes to each person, e.g.
john likes_coffee
john drives_car
john has_curly_hair
jane has_curly_hair
jane man_u_fan
...
attrs has about 30,000 elements, people can be as big 6,000,000, but the data is sparse, i.e. each person has at most 30-40 attributes
I am looking for a data structure/class in python that would allow me:
To quickly create a matrix object representing the dataset from the corresponding data file
To be able to quickly extract individual elements of the matrix as well as blocks of its rows and columns. For example, I want to answer questions like
"Give me a list of all people with {'has_curly_hair','likes_coffee','man_u_fan'}"
"Give me a union of attributes of {'mike','joe'}"
My current implementation uses a pair of arrays for the two sets and a scipy sparse matrix. So if
people = ['john','jane','mike','joe']
attrs = ['likes_coffee','has_curly_hair','has_dark_hair','drives_car','man_u_fan']
then I would create a sparse matrix data of size 4 X 5 and the sample data shown above would correspond to elements
data[0,0]
data[0,3]
data[0,1]
data[1,1]
data[1,4]
...
I also maintain two inverse indices so that I don't have to invoke people.index('mike') or attrs.index('has_curly_hair') too often
This works OK but I have to maintain the indices explicitly. This is cumbersome, for instance, when I have two datasets with different sets of people and/or attributes and I need to match rows/columns corresponding to the same person/attribute from the two sparse matrices.
So is there an aternative that would allow me to avoid using integer indices and instead use actual elements of the two sets to extract rows/columns, i.e. something like
data['john',:] # give me all attributes of 'john'
data[:,['has_curly_hair','drives_car']] # give me all people who 'has_curly_hair' or 'drives_car'
?
Assuming that no library does exactly what you want, you can create your own class SparseMatrix and overload the operator []. Heres is one way to do it (the constructor might be different to what you want to have):
class SparseMatrix():
def __init__(self, x_label, y_label):
self.data = {}
for x,y in zip(x_label,y_label):
print x,y
self.data[x] = {}
for attr in y:
self.data[x][attr] = 1
return
def __getitem__(self, index):
x,y = index
if type(x) is str:
if type(y) is str:
return 1 if y in self.data[x] else 0
if type(y) is slice:
return self.data[x].keys()
if type(x) is slice:
if type(y) is str:
res = []
for key in self.data.keys():
if y in self.data[key]:
res.append(key)
return res
if type(y) is list:
res = []
for attr in y:
res += self.__getitem__((x,attr))
return res
And in the REPL, I get:
> data = SparseMatrix(['john','jane','mike','joe'],[['likes_coffee','has_curly_hair'],['has_dark_hair'],['drives_car'],['man_u_fan']])
> data['john',:]
['has_curly_hair', 'likes_coffee']
> data[:,['has_curly_hair','drives_car']]
['john', 'mike']
One of the sparse formats is actually a dictionary. A dok_matrix is a dictionary subclass, where the keys are of the form (1,100),(30,334). That is tuples of the i,j indices.
But I found out in other SO questions that access to elements of such a format is actually slower than regular dictionary access. That is d[1,100] is slower than the equivalent dd[(1,100)]. I found that it was fastest to build a regular dictionary, and use update to add the values to the sparse dok.
But dok is useful if you want to transform the matrix to one of the computationally friendly formats like csr. And of course you can access a sparse matrix with d[100,:], something which is impossible with a regular dictionary.
For some uses a default dictionary can be quick and useful. In other words a dictionary where the keys are 'people', and the values are lists or other dictionaries with 'attribute' keys.
Anyways, sparse matrix does not have provision word indices. Remember, it's roots are in linear algebra, calculating matrix products and inverses of large sparse numeric matrices. It's use for text databases is relatively recent.

Python Dictionary of Pointers (how to track roots when merging trees)

I am attempting to implement an algorithm (in Python) which involves a growing forest. The number of nodes are fixed, and at each step an edge is added. Throughout the course of the algorithm I need to keep track of the roots of the trees. This is a fairly common problem, e.g. Kruskal's Algorithm. Naively one might compute the root on the fly, but my forest is too large to make this feasable. A second attempt might be to keep a dictionary keyed by the nodes and whose values are the roots of the tree containing the node. This seems more promising, but I need to avoid updating the dictionary value of every node in two trees to be merged (the trees eventually get very deep and this is too computationally expensive). I was hopeful when I found the topic:
Simulating Pointers in Python
The notion was to keep a pointer to the root of each tree and simply update the roots when trees were merged. However, I quickly ran into the following (undesirable) behavior:
class ref:
def __init__(self,obj): self.obj = obj
def get(self): return self.obj
def set(self,obj): self.obj=obj
a = ref(1)
b = ref(2)
c = ref(3)
a = b
b = c
print(a,b,c) # => 2, 3, 3
Of course the desired output would be 3,3,3. I I check the addresses at each step I find that a and b are indeed pointing to the same thing (after a=b), but that a is not updated when I set b=c.
a = ref(1)
b = ref(2)
c = ref(3)
print(id(a),id(b),id(c)) # => 140512500114712 140512500114768 140512500114824
a = b
print(id(a),id(b),id(c)) # => 140512500114768 140512500114768 140512500114824
b = c
print(id(a),id(b),id(c)) # => 140512500114768 140512500114824 140512500114824
My primary concern is to be able to track to roots of trees when they are merged without a costly update, I would take any reasonable solution on this front whether or not it relates to the ref class. My secondary concern is to understand why Python is behaving this way with the ref class and how to modify the class to get the desired behavior. Any help or insight with regards to these problems is greatly appreciated.
When a=b is executed, the computer is getting the value of b. It calls b.get(), so 2 is returned. Therefore, a points to 2, not b.
If you used a.set(b) instead, then a would point to b. (I hope!)
Let me know if this works and if anything needs more clarification.

Sort nodes based on inputs / outputs

I have a node system where every node only stores its inputs and outputs, but not its index. Here is a simplified example:
class Node1:
requiredInputs = []
class Node2:
requiredInputs = ["Node1"]
class Node3:
requiredInputs = ["Node2"]
class Node4:
requiredInputs = ["Node3", "Node2"]
Now I want to order that nodes, so that all inputs are already processed when processing that node. For this simple example, a possible order would be [Node1, Node2, Node3, Node4].
My first idea would be to use a brute force to check every possible combination. However, this will be very slow for a bigger number of nodes.
What would be a more efficient way to do this? I dont need an implementation, just a basic idea or algorithm.
What you want is to topologically sort the nodes.
http://en.wikipedia.org/wiki/Topological_sorting
The very basic idea would be to assign an integer to each node that is at the beginning equal to the number of outputs it has. Then add all the nodes with the value 0 (that is, those that have no outputs) to the list that will represent the order. For each node that is ever appended to the list, subtract one from the values associated with the nodes that are inputs to that node. If any of those nodes now have the value of zero, add them to the list as well. Repeat doing it. It is guaranteed that eventually the process terminates, as long as you don't have cycles, and that nodes in the list will be sorted in such a way that inputs always go before outputs.
Algorithm
Topological sort is indeed the way to go; Per your request, I will not write the full implementation.
Outline & notes
Types
First, you code store the requiredInputs as classes, not as strings. This will make the comparison way more elegant:
class Node1:
requiredInputs = []
class Node2:
requiredInputs = [Node1]
class Node3:
requiredInputs = [Node2]
class Node4:
requiredInputs = [Node3, Node2]
Input and output data structures
Then, you can place your nodes in two arrays, for input and output. This can be done in-place (using a single array), but it's rarely worth the trouble.
unordered_nodes = [Node4, Node3, Node2, Node1]
ordered_nodes = []
Here's the algorithm outline:
while there are unordered_nodes:
for each node N in unordered_nodes:
if the requiredInputs of N are already in ordered_nodes:
add N to ordered_nodes
remove N from unordered_nodes
break
Expected result
When implemented, it should give:
print ordered_nodes
[<class __main__.Node1 at 0x10a7a8bb0>,
<class __main__.Node2 at 0x10a7a83f8>,
<class __main__.Node3 at 0x10a7a80b8>,
<class __main__.Node4 at 0x10a7a8600>]
Optimizations
There are quite a few ways to optimize or otherwise improve a topological sort. As before, I'll hint a few without disclosing any implementation.
Pre-sorting the input array by some property
Sorting in-place, with a single array
Using a different data structure to represent the relations between nodes
Adding more than one node to ordered_nodes at any iteration

Categories

Resources