Related
I have a set of points and need to select an optimal subset of 3 of them, where the criterion is a linear sum of some properties of the points, and some properties of pairs of the points.
In Python, this is quite easy using itertools.combinations:
all_points = combinations(points, 3)
costs = []
for i, (p1, p2, p3) in enumerate(all_points):
costs.append((p1.weight + p2.weight + p3.weight
+ pair_weight(p1, p2) + pair_weight(p1, p3) + pair_weight(p2, p3),
i))
costs.sort()
best = all_points[costs[0][1]]
The problem is that this is a brute force solution, requiring to enumerate all possible combinations of 3 points, which is O(n^3) in the number of points and therefore easily leads to a very large number of evaluations to perform. I have been trying to research whether there is a more efficient way to do this, perhaps taking advantage of the linearity of the cost function.
I have tried turning this into a networkx graph featuring node and edge weights. However, I have not yet found an algorithm in that toolkit that can calculate the "shortest triangle", particularly one that considers both edge and node weights. (Shortest path algorithms tend to only consider edge weights for example.)
There are functions to enumerate all cliques, and then I can select 3-cliques, and calculate the cost, but this is also brute force and therefore not better than doing it with combinations as above.
Are there any other algorithms I can look at?
By the way, if I do not have the edge weights, it is easy to just sort the nodes by their node-weight and choose the first three. So it is really the paired costs that add complexity to this problem. I am wondering if somehow I can just list all pairs and find the top-k of those that form triangles, or something better? At least if I could efficiently enumerate top candidates and stop the enumeration on some heuristic, it might be better than the brute force approach.
From now on, I will use n as the number of nodes and m as the number of edges. If your graph is fully connected, then m is just n choose 2. I'll also disregard node weights, because as the comments to your initial post have noted, the node weights can be absorbed into the edges they're connected to.
Your algorithm is O(n^3); it's hopefully not too hard to see why: You iterate over every possible triplet of nodes. However, it is possible to iterate over every triangle in a graph in O(m sqrt(m)):
for every node u:
for every node v adjacent to u:
if degree(u) < degree(v): continue;
for every node w adjacent to v:
if degree(v) < degree(w): continue;
if u is not connected to w: continue;
// <u,v,w> is a triangle!
The proof for this algorithm's runtime of O(m sqrt(m)) is nontrivial, so I'll direct you here: https://cs.stanford.edu/~rishig/courses/ref/l1.pdf
If your graph is fully connected, then you've gotta stick with the O(n^3), I think. There might be some early-pruning ideas you can do but they won't lead to a significant speedup, probably 2x at very best.
I have a large graph with about 50,000 nodes and 2,000,000 edges. I need to find all pairs of nodes that are between 2 and 3 hops away from each other.
The simplest (and dirty) solution is first to create combinatorial expansion and then check each pair of nodes:
import networkx as nx
import itertools
g = nx.erdos_renyi_graph(n=5000, p=0.05)
L = list(G.nodes())
# Create a complete graph
G = nx.Graph()
G.add_nodes_from(L)
G.add_edges_from(itertools.combinations(L, 2))
However, when I try to create a complete graph with 45,000 nodes and 2 million edges, I ran out of RAM.
Is there any other solution that could inspect a large graph in a reasonable time? Thanks for any advice or pointer.
Take your edge list and convert it to an adjacency matrix, see answers here for how to do it memory-efficiently with scipy.sparse. Then use numpy.linalg.matrix_power to raise the adjacency matrix to 2nd power. For each entry in the squared matrix, if the entry is non-zero, there exists a path of length two between the nodes (in fact the entry gives you the number of paths of length 2 between the nodes). See answers here:
Powers of the adjacency matrix are concatenating walks. The ijth entry of the kth power of the adjacency matrix tells you the number of walks of length k from vertex i to vertex j.
You can do the same for paths of length 3, by raising it to the 3rd power.
Can someone help explain how can building a heap be O(n) complexity?
Inserting an item into a heap is O(log n), and the insert is repeated n/2 times (the remainder are leaves, and can't violate the heap property). So, this means the complexity should be O(n log n), I would think.
In other words, for each item we "heapify", it has the potential to have to filter down (i.e., sift down) once for each level for the heap so far (which is log n levels).
What am I missing?
I think there are several questions buried in this topic:
How do you implement buildHeap so it runs in O(n) time?
How do you show that buildHeap runs in O(n) time when implemented correctly?
Why doesn't that same logic work to make heap sort run in O(n) time rather than O(n log n)?
How do you implement buildHeap so it runs in O(n) time?
Often, answers to these questions focus on the difference between siftUp and siftDown. Making the correct choice between siftUp and siftDown is critical to get O(n) performance for buildHeap, but does nothing to help one understand the difference between buildHeap and heapSort in general. Indeed, proper implementations of both buildHeap and heapSort will only use siftDown. The siftUp operation is only needed to perform inserts into an existing heap, so it would be used to implement a priority queue using a binary heap, for example.
I've written this to describe how a max heap works. This is the type of heap typically used for heap sort or for a priority queue where higher values indicate higher priority. A min heap is also useful; for example, when retrieving items with integer keys in ascending order or strings in alphabetical order. The principles are exactly the same; simply switch the sort order.
The heap property specifies that each node in a binary heap must be at least as large as both of its children. In particular, this implies that the largest item in the heap is at the root. Sifting down and sifting up are essentially the same operation in opposite directions: move an offending node until it satisfies the heap property:
siftDown swaps a node that is too small with its largest child (thereby moving it down) until it is at least as large as both nodes below it.
siftUp swaps a node that is too large with its parent (thereby moving it up) until it is no larger than the node above it.
The number of operations required for siftDown and siftUp is proportional to the distance the node may have to move. For siftDown, it is the distance to the bottom of the tree, so siftDown is expensive for nodes at the top of the tree. With siftUp, the work is proportional to the distance to the top of the tree, so siftUp is expensive for nodes at the bottom of the tree. Although both operations are O(log n) in the worst case, in a heap, only one node is at the top whereas half the nodes lie in the bottom layer. So it shouldn't be too surprising that if we have to apply an operation to every node, we would prefer siftDown over siftUp.
The buildHeap function takes an array of unsorted items and moves them until they all satisfy the heap property, thereby producing a valid heap. There are two approaches one might take for buildHeap using the siftUp and siftDown operations we've described.
Start at the top of the heap (the beginning of the array) and call siftUp on each item. At each step, the previously sifted items (the items before the current item in the array) form a valid heap, and sifting the next item up places it into a valid position in the heap. After sifting up each node, all items satisfy the heap property.
Or, go in the opposite direction: start at the end of the array and move backwards towards the front. At each iteration, you sift an item down until it is in the correct location.
Which implementation for buildHeap is more efficient?
Both of these solutions will produce a valid heap. Unsurprisingly, the more efficient one is the second operation that uses siftDown.
Let h = log n represent the height of the heap. The work required for the siftDown approach is given by the sum
(0 * n/2) + (1 * n/4) + (2 * n/8) + ... + (h * 1).
Each term in the sum has the maximum distance a node at the given height will have to move (zero for the bottom layer, h for the root) multiplied by the number of nodes at that height. In contrast, the sum for calling siftUp on each node is
(h * n/2) + ((h-1) * n/4) + ((h-2)*n/8) + ... + (0 * 1).
It should be clear that the second sum is larger. The first term alone is hn/2 = 1/2 n log n, so this approach has complexity at best O(n log n).
How do we prove the sum for the siftDown approach is indeed O(n)?
One method (there are other analyses that also work) is to turn the finite sum into an infinite series and then use Taylor series. We may ignore the first term, which is zero:
If you aren't sure why each of those steps works, here is a justification for the process in words:
The terms are all positive, so the finite sum must be smaller than the infinite sum.
The series is equal to a power series evaluated at x=1/2.
That power series is equal to (a constant times) the derivative of the Taylor series for f(x)=1/(1-x).
x=1/2 is within the interval of convergence of that Taylor series.
Therefore, we can replace the Taylor series with 1/(1-x), differentiate, and evaluate to find the value of the infinite series.
Since the infinite sum is exactly n, we conclude that the finite sum is no larger, and is therefore, O(n).
Why does heap sort require O(n log n) time?
If it is possible to run buildHeap in linear time, why does heap sort require O(n log n) time? Well, heap sort consists of two stages. First, we call buildHeap on the array, which requires O(n) time if implemented optimally. The next stage is to repeatedly delete the largest item in the heap and put it at the end of the array. Because we delete an item from the heap, there is always an open spot just after the end of the heap where we can store the item. So heap sort achieves a sorted order by successively removing the next largest item and putting it into the array starting at the last position and moving towards the front. It is the complexity of this last part that dominates in heap sort. The loop looks likes this:
for (i = n - 1; i > 0; i--) {
arr[i] = deleteMax();
}
Clearly, the loop runs O(n) times (n - 1 to be precise, the last item is already in place). The complexity of deleteMax for a heap is O(log n). It is typically implemented by removing the root (the largest item left in the heap) and replacing it with the last item in the heap, which is a leaf, and therefore one of the smallest items. This new root will almost certainly violate the heap property, so you have to call siftDown until you move it back into an acceptable position. This also has the effect of moving the next largest item up to the root. Notice that, in contrast to buildHeap where for most of the nodes we are calling siftDown from the bottom of the tree, we are now calling siftDown from the top of the tree on each iteration! Although the tree is shrinking, it doesn't shrink fast enough: The height of the tree stays constant until you have removed the first half of the nodes (when you clear out the bottom layer completely). Then for the next quarter, the height is h - 1. So the total work for this second stage is
h*n/2 + (h-1)*n/4 + ... + 0 * 1.
Notice the switch: now the zero work case corresponds to a single node and the h work case corresponds to half the nodes. This sum is O(n log n) just like the inefficient version of buildHeap that is implemented using siftUp. But in this case, we have no choice since we are trying to sort and we require the next largest item be removed next.
In summary, the work for heap sort is the sum of the two stages: O(n) time for buildHeap and O(n log n) to remove each node in order, so the complexity is O(n log n). You can prove (using some ideas from information theory) that for a comparison-based sort, O(n log n) is the best you could hope for anyway, so there's no reason to be disappointed by this or expect heap sort to achieve the O(n) time bound that buildHeap does.
Your analysis is correct. However, it is not tight.
It is not really easy to explain why building a heap is a linear operation, you should better read it.
A great analysis of the algorithm can be seen here.
The main idea is that in the build_heap algorithm the actual heapify cost is not O(log n)for all elements.
When heapify is called, the running time depends on how far an element might move down in the tree before the process terminates. In other words, it depends on the height of the element in the heap. In the worst case, the element might go down all the way to the leaf level.
Let us count the work done level by level.
At the bottommost level, there are 2^(h)nodes, but we do not call heapify on any of these, so the work is 0. At the next level there are 2^(h − 1) nodes, and each might move down by 1 level. At the 3rd level from the bottom, there are 2^(h − 2) nodes, and each might move down by 2 levels.
As you can see not all heapify operations are O(log n), this is why you are getting O(n).
Intuitively:
"The complexity should be O(nLog n)... for each item we "heapify", it has the potential to have to filter down once for each level for the heap so far (which is log n levels)."
Not quite. Your logic does not produce a tight bound -- it over estimates the complexity of each heapify. If built from the bottom up, insertion (heapify) can be much less than O(log(n)). The process is as follows:
( Step 1 ) The first n/2 elements go on the bottom row of the heap. h=0, so heapify is not needed.
( Step 2 ) The next n/22 elements go on the row 1 up from the bottom. h=1, heapify filters 1 level down.
( Step i )
The next n/2i elements go in row i up from the bottom. h=i, heapify filters i levels down.
( Step log(n) ) The last n/2log2(n) = 1 element goes in row log(n) up from the bottom. h=log(n), heapify filters log(n) levels down.
NOTICE: that after step one, 1/2 of the elements (n/2) are already in the heap, and we didn't even need to call heapify once. Also, notice that only a single element, the root, actually incurs the full log(n) complexity.
Theoretically:
The Total steps N to build a heap of size n, can be written out mathematically.
At height i, we've shown (above) that there will be n/2i+1 elements that need to call heapify, and we know heapify at height i is O(i). This gives:
The solution to the last summation can be found by taking the derivative of both sides of the well known geometric series equation:
Finally, plugging in x = 1/2 into the above equation yields 2. Plugging this into the first equation gives:
Thus, the total number of steps is of size O(n)
There are already some great answers but I would like to add a little visual explanation
Now, take a look at the image, there are
n/2^1 green nodes with height 0 (here 23/2 = 12)
n/2^2 red nodes with height 1 (here 23/4 = 6)
n/2^3 blue node with height 2 (here 23/8 = 3)
n/2^4 purple nodes with height 3 (here 23/16 = 2)
so there are n/2^(h+1) nodes for height h
To find the time complexity lets count the amount of work done or max no of iterations performed by each node
now it can be noticed that each node can perform(atmost) iterations == height of the node
Green = n/2^1 * 0 (no iterations since no children)
red = n/2^2 * 1 (heapify will perform atmost one swap for each red node)
blue = n/2^3 * 2 (heapify will perform atmost two swaps for each blue node)
purple = n/2^4 * 3 (heapify will perform atmost three swaps for each purple node)
so for any nodes with height h maximum work done is n/2^(h+1) * h
Now total work done is
->(n/2^1 * 0) + (n/2^2 * 1)+ (n/2^3 * 2) + (n/2^4 * 3) +...+ (n/2^(h+1) * h)
-> n * ( 0 + 1/4 + 2/8 + 3/16 +...+ h/2^(h+1) )
now for any value of h, the sequence
-> ( 0 + 1/4 + 2/8 + 3/16 +...+ h/2^(h+1) )
will never exceed 1
Thus the time complexity will never exceed O(n) for building heap
It would be O(n log n) if you built the heap by repeatedly inserting elements. However, you can create a new heap more efficiently by inserting the elements in arbitrary order and then applying an algorithm to "heapify" them into the proper order (depending on the type of heap of course).
See http://en.wikipedia.org/wiki/Binary_heap, "Building a heap" for an example. In this case you essentially work up from the bottom level of the tree, swapping parent and child nodes until the heap conditions are satisfied.
As we know the height of a heap is log(n), where n is the total number of elements.Lets represent it as h
When we perform heapify operation, then the elements at last level(h) won't move even a single step.
The number of elements at second last level(h-1) is 2h-1 and they can move at max 1 level(during heapify). Similarly, for the ith, level we have 2i elements which can move h-i positions.
Therefore total number of moves:
S= 2h*0+2h-1*1+2h-2*2+...20*h
S=2h {1/2 + 2/22 + 3/23+ ... h/2h} -------------------------------------------------1
this is AGP series, to solve this divide both sides by 2
S/2=2h {1/22 + 2/23+ ... h/2h+1} -------------------------------------------------2
subtracting equation 2 from 1 gives
S/2=2h {1/2+1/22 + 1/23+ ...+1/2h+ h/2h+1}
S=2h+1 {1/2+1/22 + 1/23+ ...+1/2h+ h/2h+1}
Now 1/2+1/22 + 1/23+ ...+1/2h is decreasing GP whose sum is less than 1 (when h tends to infinity, the sum tends to 1). In further analysis, let's take an upper bound on the sum which is 1.
This gives:
S=2h+1{1+h/2h+1} =2h+1+h ~2h+h
as h=log(n), 2h=n
Therefore S=n+log(n) T(C)=O(n)
While building a heap, lets say you're taking a bottom up approach.
You take each element and compare it with its children to check if the pair conforms to the heap rules. So, therefore, the leaves get included in the heap for free. That is because they have no children.
Moving upwards, the worst case scenario for the node right above the leaves would be 1 comparison (At max they would be compared with just one generation of children)
Moving further up, their immediate parents can at max be compared with two generations of children.
Continuing in the same direction, you'll have log(n) comparisons for the root in the worst case scenario. and log(n)-1 for its immediate children, log(n)-2 for their immediate children and so on.
So summing it all up, you arrive on something like log(n) + {log(n)-1}*2 + {log(n)-2}*4 + ..... + 1*2^{(logn)-1} which is nothing but O(n).
We get the runtime for the heap build by figuring out the maximum move each node can take.
So we need to know how many nodes are in each row and how far from their can each node go.
Starting from the root node each next row has double the nodes than the previous row has, so by answering how often can we double the number of nodes until we don't have any nodes left we get the height of the tree.
Or in mathematical terms the height of the tree is log2(n), n being the length of the array.
To calculate the nodes in one row we start from the back, we know n/2 nodes are at the bottom, so by dividing by 2 we get the previous row and so on.
Based on this we get this formula for the Siftdown approach:
(0 * n/2) + (1 * n/4) + (2 * n/8) + ... + (log2(n) * 1)
The term in the last paranthesis is the height of the tree multiplied by the one node that is at the root, the term in the first paranthesis are all the nodes in the bottom row multiplied by the length they can travel,0.
Same formula in smart:
Bringing the n back in we have 2 * n, 2 can be discarded because its a constant and tada we have the worst case runtime of the Siftdown approach: n.
Short Answer
Building a binary heap will take O(n) time with Heapify().
When we add the elements in a heap one by one and keep satisfying the heap property (max heap or min heap) at every step, then the total time complexity will be O(nlogn).
Because the general structure of a binary heap is of a complete binary tree. Hence the height of heap is h = O(logn). So the insertion time of an element in the heap is equivalent to the height of the tree ie. O(h) = O(logn). For n elements this will take O(nlogn) time.
Consider another approach now.
I assume that we have a min heap for simplicity.
So every node should be smaller than its children.
Add all the elements in the skeleton of a complete binary tree. This will take O(n) time.
Now we just have to somehow satisfy the min-heap property.
Since all the leaf elements have no children, they already satisfy the heap property. The total no of leaf elements is ceil(n/2) where n is the total number of elements present in the tree.
Now for every internal node, if it is greater than its children, swap it with the minimum child in a bottom to top way. It will take O(1) time for every internal node. Note: We will not swap the values up to the root like we do in insertion. We just swap it once so that subtree rooted at that node is a proper min heap.
In the array-based implementation of binary heap, we have parent(i) = ceil((i-1)/2) and children of i are given by 2*i + 1 and 2*i + 2. So by observation we can say that the last ceil(n/2) elements in the array would be leaf nodes. The more the depth, the more the index of a node. We will repeat Step 4 for array[n/2], array[n/2 - 1].....array[0]. In this way we ensure that we do this in the bottom to top approach. Overall we will eventually maintain the min heap property.
Step 4 for all the n/2 elements will take O(n) time.
So our total time complexity of heapify using this approach will be O(n) + O(n) ~ O(n).
In case of building the heap, we start from height,
logn -1 (where logn is the height of tree of n elements).
For each element present at height 'h', we go at max upto (logn -h) height down.
So total number of traversal would be:-
T(n) = sigma((2^(logn-h))*h) where h varies from 1 to logn
T(n) = n((1/2)+(2/4)+(3/8)+.....+(logn/(2^logn)))
T(n) = n*(sigma(x/(2^x))) where x varies from 1 to logn
and according to the [sources][1]
function in the bracket approaches to 2 at infinity.
Hence T(n) ~ O(n)
Successive insertions can be described by:
T = O(log(1) + log(2) + .. + log(n)) = O(log(n!))
By starling approximation, n! =~ O(n^(n + O(1))), therefore T =~ O(nlog(n))
Hope this helps, the optimal way O(n) is using the build heap algorithm for a given set (ordering doesn't matter).
Basically, work is done only on non-leaf nodes while building a heap...and the work done is the amount of swapping down to satisfy heap condition...in other words (in worst case) the amount is proportional to the height of the node...all in all the complexity of the problem is proportional to the sum of heights of all the non-leaf nodes..which is (2^h+1 - 1)-h-1=n-h-1= O(n)
#bcorso has already demonstrated the proof of the complexity analysis. But for the sake of those still learning complexity analysis, I have this to add:
The basis of your original mistake is due to a misinterpretation of the meaning of the statement, "insertion into a heap takes O(log n) time". Insertion into a heap is indeed O(log n), but you have to recognise that n is the size of the heap during the insertion.
In the context of inserting n objects into a heap, the complexity of the ith insertion is O(log n_i) where n_i is the size of the heap as at insertion i. Only the last insertion has a complexity of O (log n).
Lets suppose you have N elements in a heap.
Then its height would be Log(N)
Now you want to insert another element, then the complexity would be : Log(N), we have to compare all the way UP to the root.
Now you are having N+1 elements & height = Log(N+1)
Using induction technique it can be proved that the complexity of insertion would be ∑logi.
Now using
log a + log b = log ab
This simplifies to : ∑logi=log(n!)
which is actually O(NlogN)
But
we are doing something wrong here, as in all the case we do not reach at the top.
Hence while executing most of the times we may find that, we are not going even half way up the tree. Whence, this bound can be optimized to have another tighter bound by using mathematics given in answers above.
This realization came to me after a detail though & experimentation on Heaps.
I really like explanation by Jeremy west.... another approach which is really easy for understanding is given here http://courses.washington.edu/css343/zander/NotesProbs/heapcomplexity
since, buildheap depends using depends on heapify and shiftdown approach is used which depends upon sum of the heights of all nodes. So, to find the sum of height of nodes which is given by
S = summation from i = 0 to i = h of (2^i*(h-i)), where h = logn is height of the tree
solving s, we get s = 2^(h+1) - 1 - (h+1)
since, n = 2^(h+1) - 1
s = n - h - 1 = n- logn - 1
s = O(n), and so complexity of buildheap is O(n).
"The linear time bound of build Heap, can be shown by computing the sum of the heights of all the nodes in the heap, which is the maximum number of dashed lines.
For the perfect binary tree of height h containing N = 2^(h+1) – 1 nodes, the sum of the heights of the nodes is N – H – 1.
Thus it is O(N)."
Proof of O(n)
The proof isn't fancy, and quite straightforward, I only proved the case for a full binary tree, the result can be generalized for a complete binary tree.
We can use another optimal solution to build a heap instead of inserting each element repeatedly. It goes as follows:
Arbitrarily putting the n elements into the array to respect the
shape property of heap.
Starting from the lowest level and moving upwards, sift the root of
each subtree downward as in the heapify-down process until the heap
property is restored.
This process can be illustrated with the following image:
Next, let’s analyze the time complexity of this above process. Suppose there are n elements in the heap, and the height of the heap is h (for the heap in the above image, the height is 3). Then we should have the following relationship:
When there is only one node in the last level then n = 2^h
. And when the last level of the tree is fully filled then n = 2^(h+1).
And start from the bottom as level 0 (the root node is level h), in level j, there are at most 2^(h-j) nodes. And each node at most takes j times swap operation. So in level j, the total number of operation is j*2^(h-j).
So the total running time for building the heap is proportional to:
If we factor out the 2^h term, then we get:
As we know, ∑j/2ʲ is a series converges to 2 (in detail, you can refer to this wiki).
Using this we have:
Based on the condition 2^h <= n, so we have:
Now we prove that building a heap is a linear operation.
I have the basic linear shortest path algorithm implemented in Python. According to various sites I've come across, this only works for directed acyclic graphs, including this, this, and this. However, I don't see why this is the case.
I've even tested the algorithm against graphs with cycles and un-directed edges, and it worked fine.
So the question is, why doesn't the linear shortest path algorithm work for non-directed cyclic graphs? Side question, what is the name of this algorithm?
For reference, here is the code I wrote for the algorithm:
def shortestPath(start, end, graph):
# First, topologically sort the graph, to determine which order to traverse it in
sorted = toplogicalSort(start, graph)
# Get ready to store the current weight of each node's path, and their predecessor
weights = [0] + [float('inf')] * (len(graph) - 1)
predecessor = [0] * len(graph)
# Next, relaxes all edges in the order of sorted nodes
for node in sorted:
for neighbour in graph[node]:
# Checks if it would be cheaper to take this path, as opposed to the last path
if weights[neighbour[0]] > weights[node] + neighbour[1]:
# If it is, then adjust the weight and predecessor
weights[neighbour[0]] = weights[node] + neighbour[1]
predecessor[neighbour[0]] = node
# Returns the shortest path to the end
path = [end]
while path[len(path) - 1] != start:
path.append(predecessor[path[len(path) - 1]])
return path[::-1]
Edit: As asked by Beta, here is the topological sort:
# Toplogically sorts the graph given, starting from the start point given.
def toplogicalSort(start, graph):
# Runs a DFS on all nodes connected to the starting node in the graph
def DFS(start):
for node in graph[start]:
if not node[0] in checked:
checked[node[0]] = True
DFS(node[0])
finish.append(start)
# Stores the finish point of all nodes in the graph, and a boolean stating if they have been checked
finish, checked = [], {}
DFS(start)
# Reverses the order of the sort, to get a proper topology; then returns
return finish[::-1]
Because you cannot topologically sort a graph with cycles (therefore undirected graphs are also out of the question as you can't tell which node should come before another).
Edit: After reading the comments, I think that's actually what #Beta meant.
When there is cycle, topological sort cannot guarantee the correct ordering of the shortest path.
For example, we have a graph:
A->C, A->B, B->C, C->B, B->D
Say the correct shortest path is:
A->C->B->D
But topological sort can generate an order:
A->B->C->D
Although it will update B to the correct order when visiting C, but B won't be visited again, thus not able to propagate correct weight to D. (Path happens to be correct though.)
I am working with complex networks. I want to find group of nodes which forms a cycle of 3 nodes (or triangles) in a given graph. As my graph contains about million edges, using a simple iterative solution (multiple "for" loop) is not very efficient.
I am using python for my programming, if these is some inbuilt modules for handling these problems, please let me know.
If someone knows any algorithm which can be used for finding triangles in graphs, kindly reply back.
Assuming its an undirected graph, the answer lies in networkx library of python.
if you just need to count triangles, use:
import networkx as nx
tri=nx.triangles(g)
But if you need to know the edge list with triangle (triadic) relationship, use
all_cliques= nx.enumerate_all_cliques(g)
This will give you all cliques (k=1,2,3...max degree - 1)
So, to filter just triangles i.e k=3,
triad_cliques=[x for x in all_cliques if len(x)==3 ]
The triad_cliques will give a edge list with only triangles.
A million edges is quite small. Unless you are doing it thousands of times, just use a naive implementation.
I'll assume that you have a dictionary of node_ids, which point to a sequence of their neighbors, and that the graph is directed.
For example:
nodes = {}
nodes[0] = 1,2
nodes[1] = tuple() # empty tuple
nodes[2] = 1
My solution:
def generate_triangles(nodes):
"""Generate triangles. Weed out duplicates."""
visited_ids = set() # remember the nodes that we have tested already
for node_a_id in nodes:
for node_b_id in nodes[node_a_id]:
if nod_b_id == node_a_id:
raise ValueError # nodes shouldn't point to themselves
if node_b_id in visited_ids:
continue # we should have already found b->a->??->b
for node_c_id in nodes[node_b_id]:
if node_c_id in visited_ids:
continue # we should have already found c->a->b->c
if node_a_id in nodes[node_c_id]:
yield(node_a_id, node_b_id, node_c_id)
visited_ids.add(node_a_id) # don't search a - we already have all those cycles
Checking performance:
from random import randint
n = 1000000
node_list = range(n)
nodes = {}
for node_id in node_list:
node = tuple()
for i in range(randint(0,10)): # add up to 10 neighbors
try:
neighbor_id = node_list[node_id+randint(-5,5)] # pick a nearby node
except:
continue
if not neighbor_id in node:
node = node + (neighbor_id,)
nodes[node_id] = node
cycles = list(generate_triangles(nodes))
print len(cycles)
When I tried it, it took longer to build the random graph than to count the cycles.
You might want to test it though ;) I won't guarantee that it's correct.
You could also look into networkx, which is the big python graph library.
Pretty easy and clear way to do is to use Networkx:
With Networkx you can get the loops of an undirected graph by nx.cycle_basis(G) and then select the ones with 3 nodes
cycls_3 = [c for c in nx.cycle_basis(G) if len(c)==3]
or you can find all the cliques by find_cliques(G) and then select the ones you want (with 3 nodes). cliques are sections of the graph where all the nodes are connected to each other which happens in cycles/loops with 3 nodes.
Even though it isn't efficient, you may want to implement a solution, so use the loops. Write a test so you can get an idea as to how long it takes.
Then, as you try new approaches you can do two things:
1) Make certain that the answer remains the same.
2) See what the improvement is.
Having a faster algorithm that misses something is probably going to be worse than having a slower one.
Once you have the slow test, you can see if you can do this in parallel and see what the performance increase is.
Then, you can see if you can mark all nodes that have less than 3 vertices.
Ideally, you may want to shrink it down to just 100 or so first, so you can draw it, and see what is happening graphically.
Sometimes your brain will see a pattern that isn't as obvious when looking at algorithms.
I don't want to sound harsh, but have you tried to Google it? The first link is a pretty quick algorithm to do that:
http://www.mail-archive.com/algogeeks#googlegroups.com/msg05642.html
And then there is this article on ACM (which you may have access to):
http://portal.acm.org/citation.cfm?id=244866
(and if you don't have access, I am sure if you kindly ask the lady who wrote it, you will get a copy.)
Also, I can imagine a triangle enumeration method based on clique-decomposition, but I don't know if it was described somewhere.
I am working on the same problem of counting number of triangles on undirected graph and wisty's solution works really well in my case. I have modified it a bit so only undirected triangles are counted.
#### function for counting undirected cycles
def generate_triangles(nodes):
visited_ids = set() # mark visited node
for node_a_id in nodes:
temp_visited = set() # to get undirected triangles
for node_b_id in nodes[node_a_id]:
if node_b_id == node_a_id:
raise ValueError # to prevent self-loops, if your graph allows self-loops then you don't need this condition
if node_b_id in visited_ids:
continue
for node_c_id in nodes[node_b_id]:
if node_c_id in visited_ids:
continue
if node_c_id in temp_visited:
continue
if node_a_id in nodes[node_c_id]:
yield(node_a_id, node_b_id, node_c_id)
else:
continue
temp_visited.add(node_b_id)
visited_ids.add(node_a_id)
Of course, you need to use a dictionary for example
#### Test cycles ####
nodes = {}
nodes[0] = [1, 2, 3]
nodes[1] = [0, 2]
nodes[2] = [0, 1, 3]
nodes[3] = [1]
cycles = list(generate_triangles(nodes))
print cycles
Using the code of Wisty, the triangles found will be
[(0, 1, 2), (0, 2, 1), (0, 3, 1), (1, 2, 3)]
which counted the triangle (0, 1, 2) and (0, 2, 1) as two different triangles. With the code I modified, these are counted as only one triangle.
I used this with a relatively small dictionary of under 100 keys and each key has on average 50 values.
Surprised to see no mention of the Networkx triangles function. I know it doesn't necessarily return the groups of nodes that form a triangle, but should be pretty relevant to many who find themselves on this page.
nx.triangles(G) # list of how many triangles each node is part of
sum(nx.triangles(G).values())/3 # total number of triangles
An alternative way to return clumps of nodes would be something like...
for u,v,d in G.edges(data=True):
u_array = adj_m.getrow(u).nonzero()[1] # get lists of all adjacent nodes
v_array = adj_m.getrow(v).nonzero()[1]
# find the intersection of the two sets - these are the third node of the triangle
np.intersect1d(v_array,u_array)
If you don't care about multiple copies of the same triangle in different order then a list of 3-tuples works:
from itertools import combinations as combos
[(n,nbr,nbr2) for n in G for nbr, nbr2 in combos(G[n],2) if nbr in G[nbr2]]
The logic here is to check each pair of neighbors of every node to see if they are connected. G[n] is a fast way to iterate over or look up neighbors.
If you want to get rid of reorderings, turn each triple into a frozenset and make a set of the frozensets:
set(frozenset([n,nbr,nbr2]) for n in G for nbr, nbr2 in combos(G[n]) if nbr in G[nbr2])
If you don't like frozenset and want a list of sets then:
triple_iter = ((n, nbr, nbr2) for n in G for nbr, nbr2 in combos(G[n],2) if nbr in G[nbr2])
triangles = set(frozenset(tri) for tri in triple_iter)
nice_triangles = [set(tri) for tri in triangles]
Do you need to find 'all' of the 'triangles', or just 'some'/'any'?
Or perhaps you just need to test whether a particular node is part of a triangle?
The test is simple - given a node A, are there any two connected nodes B & C that are also directly connected.
If you need to find all of the triangles - specifically, all groups of 3 nodes in which each node is joined to the other two - then you need to check every possible group in a very long running 'for each' loop.
The only optimisation is ensuring that you don't check the same 'group' twice, e.g. if you have already tested that B & C aren't in a group with A, then don't check whether A & C are in a group with B.
This is a more efficient version of Ajay M answer (I would have commented it, but I've not enough reputation).
Indeed the enumerate_all_cliques method of networkx will return all cliques in the graph, irrespectively of their length; hence looping over it may take a lot of time (especially with very dense graphs).
Moreover, once defined for triangles, it's just a matter of parametrization to generalize the method for every clique length so here's a function:
import networkx as nx
def get_cliques_by_length(G, length_clique):
""" Return the list of all cliques in an undirected graph G with length
equal to length_clique. """
cliques = []
for c in nx.enumerate_all_cliques(G) :
if len(c) <= length_clique:
if len(c) == length_clique:
cliques.append(c)
else:
return cliques
# return empty list if nothing is found
return cliques
To get triangles just use get_cliques_by_length(G, 3).
Caveat: this method works only for undirected graphs. Algorithm for cliques in directed graphs are not provided in networkx
i just found that nx.edge_disjoint_paths works to count the triangle contains certain edges. faster than nx.enumerate_all_cliques and nx.cycle_basis.
It returns the edges disjoint paths between source and target.Edge disjoint paths are paths that do not share any edge.
And result-1 is the number of triangles that contain certain edges or between source node and target node.
edge_triangle_dict = {}
for i in g.edges:
edge_triangle_dict[i] = len(list(nx.edge_disjoint_paths(g, i[0], i[1]))-1)