Find a cycle of 3 (triangle) from adjacency matrix

Find a cycle of 3 (triangle) from adjacency matrix - python

I have a code which gets a number of triangles in an Undirected Graph using matrix multiplication method. Now I would like it to also print these triangles, preferably to print those vertexes. It could be done with third party libraries, e.g. numpy or networkx, but it has to be done with matrix multiplication, as I know that I could do it with naive version.
To make it simplier I will use the easiest adjacency matrix:
[[0, 1, 0, 0],
[1, 0, 1, 1],
[0, 1, 0, 1],
[0, 1, 1, 0]]
it has edges:
x,y
0,1
1,2
1,3
2,3
So the triangle exsists between vertexes 1,2,3 and this is what I would like this program ALSO prints to the console
Now the code, which just prints how many triangles are in this graph:
# num of vertexes
V = 4
# graph from adjacency matrix
graph = [[0, 1, 0, 0],
[1, 0, 1, 1],
[0, 1, 0, 1],
[0, 1, 1, 0]]
# get the vertexes in a dict
vertexes = {}
for i in range(len(graph)):
vertexes[i] = i
print(vertexes)
## >> {0: 0, 1: 1, 2: 2, 3: 3}
# matrix multiplication
def multiply(A, B, C):
global V
for i in range(V):
for j in range(V):
C[i][j] = 0
for k in range(V):
C[i][j] += A[i][k] * B[k][j]
# Utility function to calculate
# trace of a matrix (sum of
# diagonal elements)
def getTrace(graph):
global V
trace = 0
for i in range(V):
trace += graph[i][i]
return trace
# Utility function for calculating
# number of triangles in graph
def triangleInGraph(graph):
global V
# To Store graph^2
aux2 = [[None] * V for _ in range(V)]
# To Store graph^3
aux3 = [[None] * V for i in range(V)]
# Initialising aux
# matrices with 0
for i in range(V):
for j in range(V):
aux2[i][j] = aux3[i][j] = 0
# aux2 is graph^2 now printMatrix(aux2)
multiply(graph, graph, aux2)
# after this multiplication aux3 is
# graph^3 printMatrix(aux3)
multiply(graph, aux2, aux3)
trace = getTrace(aux3)
return trace // 6
print("Total number of Triangle in Graph :",
triangleInGraph(graph))
## >> Total number of Triangle in Graph : 1

The thing is, the information of the triangle (more generally speaking, information of paths between a vertex i and a vertex j) is lost during that matrix multiplication process. All that is stored is that the path exist.
For adjacency matrix itself, whose numbers are the number of length 1 paths between i and j, answer is obvious, because if a path exists, then it has to be edge (i,j). But even in M², when you see number 2 at row i column j of M², well, all you know is that there are 2 length 2 paths connecting i to j. So, that it exists 2 different index k₁ and k₂, such as (i,k₁) and (k₁,j) are edges, and so are (i,k₂) and (k₂, j).
That is exactly why matrix multiplication works (and that is a virtue of coding as explicitly as you did: I don't need to recall you that element M²ᵢⱼ = ΣMᵢₖ×Mₖⱼ
So it is exactly that: 1 for all intermediate vertex k such as (i,k) and (k,j) are both edges. So 1 for all intermediate vertex k such as (i,k),(k,j) is a length 2 path for i to j.
But as you can see, that Σ is just a sum. In a sum, we loose the detail of what contributed to the sum.
In other words, nothing to do from what you computed. You've just computed the number of length-3 path from i to j, for all i and j, and, in particular what you are interested in, the number of length-3 paths from i to i for all i.
So the only solution you have, is to write another algorithm, that does a completely different computation (but makes yours useless: why compute the number of paths, when you have, or you will compute the list of paths?).
That computation is a rather classic one: you are just looking for paths from a node to another. Only, those two nodes are the same.
Nevertheless the most classical algorithm (Dijkstra, Ford, ...) are not really useful here (you are not searching the shortest one, and you want all paths, not just one).
One method I can think of, is to start nevertheless ("nevertheless" because I said earlier that your computing of length of path was redundant) from your code. Not that it is the easiest way, but now that your code is here; besides, I allways try to stay as close as possible from the original code
Compute a matrix of path
As I've said earlier, the formula ΣAᵢₖBₖⱼ makes sense: it is computing the number of cases where we have some paths (Aᵢₖ) from i to k and some other paths (Bₖⱼ) from k to j.
You just have to do the same thing, but instead of summing a number, sum a list of paths.
For the sake of simplicity, here, I'll use lists to store paths. So path i,k,j is stored in a list [i,k,j]. So in each cell of our matrix we have a list of paths, so a list of list (so since our matrix is itself implemented as a list of list, that makes the path matrix a list of list of list of list)
The path matrix (I made up the name just now. But I am pretty sure it has already an official name, since the idea can't be new. And that official name is probably "path matrix") for the initial matrix is very simple: each element is either [] (no path) where Mᵢⱼ is 0, and is [[i,j]] (1 path, i→j) where Mᵢⱼ is 1.
So, let's build it
def adjacencyToPath(M):
P=[[[] for _ in range(len(M))] for _ in range(len(M))]
for i in range(len(M)):
for j in range(len(M)):
if M[i][j]==1:
P[i][j]=[[i,j]]
else:
P[i][j]=[]
return P
Now that you've have that, we just have to follow the same idea as in the matrix multiplication. For example (to use the most complete example, even if out of your scope, since you don't compute more than M³) when you compute M²×M³, and say M⁵ᵢⱼ = ΣM²ᵢₖM³ₖⱼ that means that if M²ᵢₖ is 3 and M³ₖⱼ is 2, then you have 6 paths of length 5 between i and j whose 3rd step is at node k: all the 6 possible combination of the 3 ways to go from i to k in 3 steps and the 2 ways to go from k to j in 2 steps.
So, let's do also that for path matrix.
# Args=2 list of paths.
# Returns 1 list of paths
# Ex, if p1=[[1,2,3], [1,4,3]] and p2=[[3,2,4,2], [3,4,5,2]]
# Then returns [[1,2,3,2,4,2], [1,2,3,4,5,2], [1,4,3,2,4,2], [1,4,3,4,5,2]]
def combineListPath(lp1, lp2):
res=[]
for p1 in lp1:
for p2 in lp2:
res.append(p1+p2[1:]) # p2[0] is redundant with p1[-1]
return res
And the path matrix multiplication therefore goes like this
def pathMult(P1, P2):
res=[[[] for _ in range(len(P1))] for _ in range(len(P1))]
for i in range(len(P1)):
for j in range(len(P1)):
for k in range(len(P1)):
res[i][j] += combineListPath(P1[i][k], P2[k][j])
return res
So, all we have to do now, is to use this pathMult function as we use the matrix multiplication. As you computed aux2, let compute pm2
pm=adjacencyToPath(graph)
pm2=pathMult(pm, pm)
and as you computed aux3, let's compute pm3
pm3=pathMult(pm, pm2)
And now, you have in pm3, at each cell pm3[i][j] the list of paths of length 3, from i to j. And in particular, in all pm3[i][i] you have the list of triangles.
Now, the advantage of this method is that it mimics exactly your way of computing the number of paths: we do the exact same thing, but instead of retaining the number of paths, we retain the list of them.
Faster way
Obviously there are more efficient way. For example, you could just search pair (i,j) of connected nodes such as there is a third node k connected to both i and j (with an edge (j,k) and an edge (k,i), making no assumption whether your graph is oriented or not).
def listTriangle(M):
res=[]
for i in range(len(M)):
for j in range(i,len(M)):
if M[i][j]==0: continue
# So, at list point, we know i->j is an edge
for k in range(i,len(M)):
if M[j,k]>0 and M[k,i]>0:
res.append( (i,j,k) )
return res
We assume j≥i and k≥i, because triangles (i,j,k), (j,k,i) and (k,i,j) are the same, and exist all or none.
It could be optimized if we make the assumption that we are always in a non-oriented (or at least symmetric) graph, as you example suggest. In which case, we can assume i≤j≤k for example (since triangles (i,j,k) and (i,k,j) are also the same), turning the 3rd for from for k in range(i, len(M)) to for k in range(j, len(M)). And also if we exclude loops (either because there are none, as in your example, or because we don't want to count them as part of a triangle), then you can make the assumption i<j<k. Which then turns the 2 last loops into for j in range(i+1, len(M)) and for k in range(j+1, len(M)).
Optimisation
Last thing I didn't want to introduce until now, to stay as close as possible to your code. It worth mentioning that python already has some matrix manipulation routines, through numpy and the # operator. So it is better to take advantage of it (even tho I took advantage of the fact you reinvented the wheel of matrix multiplication to explain my path multiplication).
Your code, for example, becomes
import numpy as np
graph = np.array([[0, 1, 0, 0],
[1, 0, 1, 1],
[0, 1, 0, 1],
[0, 1, 1, 0]])
# Utility function for calculating
# number of triangles in graph
# That is the core of your code
def triangleInGraph(graph):
return (graph # graph # graph).trace()//6 # numpy magic
# shorter that your version, isn't it?
print("Total number of Triangle in Graph :",
triangleInGraph(graph))
## >> Total number of Triangle in Graph : 1
Mine is harder to optimize that way, but that can be done. We just have to define a new type, PathList, and define what are multiplication and addition of pathlists.
class PathList:
def __init__(self, pl):
self.l=pl
def __mul__(self, b): # That's my previous pathmult
res=[]
for p1 in self.l:
for p2 in b.l:
res.append(p1+p2[1:])
return PathList(res)
def __add__(self,b): # Just concatenation of the 2 lists
return PathList(self.l+b.l)
# For fun, a compact way to print it
def __repr__(self):
res=''
for n in self.l:
one=''
for o in n:
one=one+'→'+str(o)
res=res+','+one[1:]
return '<'+res[1:]+'>'
Using list pathlist (which is just the same list of list as before, but with add and mul operators), we can now redefine our adjacencyToPath
def adjacencyToPath(M):
P=[[[] for _ in range(len(M))] for _ in range(len(M))]
for i in range(len(M)):
for j in range(len(M)):
if M[i][j]==1:
P[i][j]=PathList([[i,j]])
else:
P[i][j]=PathList([])
return P
And now, a bit of numpy magic
pm = np.array(adjacencyToPath(graph))
pm3 = pm#pm#pm
triangles = [pm3[i,i] for i in range(len(pm3))]
pm3 is the matrix of all paths from i to j. So pm3[i,i] are the triangles.
Last remark
Some python remarks on your code.
It is better to compute V from your data, that assuming that coder is coherent when they choose V=4 and a graph 4x4. So V=len(graph) is better
You don't need global V if you don't intend to overwrite V. And it is better to avoid as many global keywords as possible. I am not repeating a dogma here. I've nothing against a global variable from times to times, if we know what we are doing. Besides, in python, there is already a sort of local structure even for global variables (they are still local to the unit), so it is not as in some languages where global variables are a high risks of collision with libraries symbols. But, well, not need to take the risk of overwriting V.
No need neither for the allocate / then write in way you do your matrix multiplication (like for matrix multiplication. You allocate them first, then call matrixmultiplication(source1, source2, dest). You can just return a new matrix. You have a garbage collector now. Well, sometimes it is still a good idea to spare some work to the allocation/garbage collector. Especially if you intended to "recycle" some variables (like in mult(A,A,B); mult(A,B,C); mult(A,C,B) where B is "recycled")

Since the triangles are defined by a sequence o vertices i,j,k such that , we can define the following function:
def find_triangles(adj, n=None):
if n is None:
n = len(adj)
triangles = []
for i in range(n):
for j in range(i + 1, n):
for k in range(j + 1, n):
if (adj[i][j] and adj[j][k] and adj[k][i]):
triangles.append([i, j, k])
return triangles
print("The triangles are: ", find_triangles(graph, V))
## >> The triangles are: [[1, 2, 3]]

Related

generate random directed fully-accessible adjacent probability matrix

given V nodes and E connections as parameters, how do I generate random directed fully-!connected! adjacent probability matrix, where all the connections weights fanning out of a node sum to 1.
The idea is after I pick random starting node to do a random walk according to the probabilities thus generating similar-random-structured sequences.
Although I prefer adj-matrix, graph is OK too.
Of course the fan-out connections can be one or many.
Loops are OK just not with itself.
I can do the walk using np.random.choice(nodes,prob)
Now that Jerome mention it it seem I was mistaken .. I dont want fully-coonnected BUT a closed-loop where there are no islands of sub-graphs i.e. all nodes are accessible via others.
Sorry I dont know how is this type of graph called ?
here is my complex solution ;(
def gen_adjmx(self):
passx = 1
c = 0 #connections so far
#until enough conns are generated
while c < self.nconns :
#loop the rows
for sym in range(self.nsyms):
if c >= self.nconns : break
if passx == 1 : #guarantees at least one connection
self.adj[sym, randint(self.nsyms) ] = randint(100)
else:
if randint(2) == 1 : #maybe a conn ?
col = randint(self.nsyms)
#already exists
if self.adj[sym, col] > 0 : continue
self.adj[sym, col ] = randint(100)
c += 1
passx += 1
self.adj /= self.adj.sum(axis=0)

You can simply create a random matrix and normalize the rows so that the sum is 1:
v = np.random.rand(n, n)
v /= v.sum(axis=1)

You mentioned that you want a graph which doesn't have any islands. I guess what you mean is that the adjacency matrix should be irreducible, i.e. the associated graph doesn't have any disconnected components.
One way to generate a random graph with the required property is to generate a random graph and then see if it has the property; throw it out and try again if it doesn't, otherwise keep it.
Here's a sketch of a solution with that in mind.
(1) generate a matrix n_vertices by n_vertices, which contains n_edges elements which are 1, and the rest are 0. This is a random adjacency matrix.
(2) test the adjacency matrix to see if it's irreducible. If so, keep it, otherwise go back to step 1.
I'm sure you can implement that in Python. I tried a proof of concept in Maxima (https://maxima.sourceforge.io), since it's convenient in some ways. There are probably ways to go about it which directly construct an irreducible matrix.
I implemented the irreducibility test for a matrix A as whether sum(A^^k, k, 0, n) has any 0 elements, according to: https://math.stackexchange.com/a/1703650 That test becomes more and more expensive as the number of vertices grows; and as the ratio of edges to vertices decreases, it increases the probability that you'll have to repeat steps 1 and 2. Whether that's tolerable for you depends on the typical number of vertices and edges you're working with.
random_irreducible (n_vertices, n_edges) :=
block ([A, n: 1],
while not irreducible (A: random_adjacency (n_vertices, n_edges))
do n: n + 1,
[A, n]);
random_adjacency (n_vertices, n_edges) :=
block([list_01, list_01_permuted, get_element],
list_01: append (makelist (1, n_edges), makelist (0, n_vertices^2 - n_edges)),
list_01_permuted: random_permutation (list_01),
get_element: lambda ([i, j], list_01_permuted[1 + (i - 1) + (j - 1)*n_vertices]),
genmatrix (get_element, n_vertices, n_vertices));
irreducible (A) :=
is (member (0, flatten (args (sum (A^^k, k, 0, length(A))))) = false);
A couple of things, one is I left out the part about normalizing edge weights so they sum to 1. I guess you'll have to put in that part to get a transition matrix and not just an adjacency matrix. The other is that I didn't prevent elements on the diagonal, i.e., you can stay on a vertex instead of always going to another one. If that's important, you'll have to deal with that too.

Creating a minimal graph representing all combinations of 3-bit binary strings

I have an algorithm that creates a graph that has all representations of 3-bit binary strings encoded in the form of the shortest graph paths, where an even number in the path means 0, while an odd number means 1:
from itertools import permutations, product
import networkx as nx
import progressbar
import itertools
def groups(sources, template):
func = permutations
keys = sources.keys()
combos = [func(sources[k], template.count(k)) for k in keys]
for t in product(*combos):
d = {k: iter(n) for k, n in zip(keys, t)}
yield [next(d[k]) for k in template]
g = nx.Graph()
added = []
good = []
index = []
# I create list with 3-bit binary strings
# I do not include one of the pairs of binary strings that have a mirror image
list_1 = [list(i) for i in itertools.product(tuple(range(2)), repeat=3) if tuple(reversed(i)) >= tuple(i)]
count = list(range(len(list_1)))
h = 0
while len(added) < len(list_1):
# In each next step I enlarge the list 'good` by the next even and odd number
if h != 0:
for q in range(2):
good.append([i for i in good if i%2 == q][-1] + 2)
# I create a list `c` with string indices from the list` list_1`, that are not yet used.
# Whereas the `index` list stores the numbering of strings from the list` list_1`, whose representations have already been correctly added to the `added` list.
c = [item for item in count if item not in index]
for m in c:
# I create representations of binary strings, where 0 is 'v0' and 1 is 'v1'. For example, the '001' combination is now 'v0v0v1'
a = ['v{}'.format(x%2) for x in list_1[m]]
if h == 0:
for w in range(2):
if len([i for i in good if i%2 == w]) < a.count('v{}'.format(w)):
for j in range(len([i for i in good if i%2 == w]), a.count('v{}'.format(w))):
good.insert(j,2*j + w)
sources={}
for x in range(2):
sources["v{0}".format(x)] = [n for n in good if n%2 == x]
# for each representation in the form 'v0v0v1' for example, I examine all combinations of strings where 'v0' is an even number 'a' v1 'is an odd number, choosing values from the' dobre2 'list and checking the following conditions.
for aaa_binary in groups(sources, a):
# Here, the edges and nodes are added to the graph from the combination of `aaa_binary` and checking whether the combination meets the conditions. If so, it is added to the `added` list. If not, the newly added edges are removed and the next `aaa_binary` combination is taken.
g.add_nodes_from (aaa_binary)
t1 = (aaa_binary[0],aaa_binary[1])
t2 = (aaa_binary[1],aaa_binary[2])
added_now = []
for edge in (t1,t2):
if not g.has_edge(*edge):
g.add_edge(*edge)
added_now.append(edge)
added.append(aaa_binary)
index.append(m)
for f in range(len(added)):
if nx.shortest_path(g, aaa_binary[0], aaa_binary[2]) != aaa_binary or nx.shortest_path(g, added[f][0], added[f][2]) != added[f]:
for edge in added_now:
g.remove_edge(*edge)
added.remove(aaa_binary)
index.remove(m)
break
# Calling a good combination search interrupt if it was found and the result added to the `added` list, while the index from the list 'list_1` was added to the` index` list
if m in index:
break
good.sort()
set(good)
index.sort()
h = h+1
Output paths representing 3-bit binary strings from added:
[[0, 2, 4], [0, 2, 1], [2, 1, 3], [1, 3, 5], [0, 3, 6], [3, 0, 7]]
So these are representations of 3-bit binary strings:
[[0, 0, 0], [0, 0, 1], [0, 1, 1], [1, 1, 1], [0, 1, 0], [1, 0, 1]]
Where in the step h = 0 the first 4 sub-lists were found, and in the step h = 1 the last two sub-lists were added.
Of course, as you can see, there are no reflections of the mirrored strings, because there is no such need in an undirected graph.
Graph:
The above solution creates a minimal graph and with the unique shortest paths. This means that one combination of a binary string has only one representation on the graph in the form of the shortest path. So the choice of a given path is a single-pointing indication of a given binary sequence.
Now I would like to use multiprocessing on the for m in c loop, because the order of finding elements does not matter here.
I try to use multiprocessing in this way:
from multiprocessing import Pool
added = []
def foo(i):
added = []
# do something
added.append(x[i])
return added
if __name__ == '__main__':
h = 0
while len(added)<len(c):
pool = Pool(4)
result = pool.imap_unordered(foo, c)
added.append(result[-1])
pool.close()
pool.join()
h = h + 1
Multiprocessing takes place in the while-loop, and in the foo function, the
added list is created. In each subsequent step h in the loop, the listadded should be incremented by subsequent values, and the current list added should be used in the functionfoo. Is it possible to pass the current contents of the list to the function in each subsequent step of the loop? Because in the above code, the foo function creates the new contents of the added list from scratch each time. How can this be solved?
Which in consequence gives bad results:
[[0, 2, 4], [0, 2, 1], [2, 1, 3], [1, 3, 5], [0, 1, 2], [1, 0, 3]]
Because for such a graph, nodes and edges, the condition is not met that nx.shortest_path (graph, i, j) == added[k] for every final nodes i, j from added[k] for k in added list.
Where for h = 0 to the elements [0, 2, 4], [0, 2, 1], [2, 1, 3], [1, 3, 5] are good, while elements added in the steph = 1, ie [0, 1, 2], [1, 0, 3] are evidently found without affecting the elements from the previous step.
How can this be solved?
I realize that this is a type of sequential algorithm, but I am also interested in partial solutions, i.e. parallel processes even on parts of the algorithm. For example, that the steps of h while looping run sequentially, but thefor m in c loop is multiprocessing. Or other partial solutions that will improve the entire algorithm for larger combinations.
I will be grateful for showing and implementing some idea for the use of multiprocessing in my algorithm.

I don't think you can parallelise the code as it is currently. The part that you're wanting to parallelise, the for m in c loop accesses three lists that are global good, added and index and the graph g itself. You could use a multiprocessing.Array for the lists, but that would undermine the whole point of parallelisation as multiprocessing.Array (docs) is synchronised, so the processes would not actually be running in parallel.
So, the code needs to be refactored. My preferred way of parallelising algorithms is to use a kind of a producer / consumer pattern
initialisation to set up a job queue that needs to be executed (runs sequentially)
have a pool of workers that all pull jobs from that queue (runs in parallel)
after the job queue has been exhausted, aggregate results and build up the final solution (runs sequentially)
In this case 1. would be the setup code for list_1, count and probably the h == 0 case. After that you would build a queue of "job orders", this would be the c list -> pass that list to a bunch of workers -> get the results back and aggregate. The problem is that each execution of the for m in c loop has access to global state and the global state changes after each iteration. This logically means that you can not run the code in parallel, because the first iteration changes the global state and affects what the second iteration does. That is, by definition, a sequential algorithm. You can not, at least not easily, parallelise an algorithm that iteratively builds a graph.
You could use multiprocessing.starmap and multiprocessing.Array, but that doesn't solve the problem. You still have the graph g which is also shared between all processes. So the whole thing would need to be refactored in such a way that each iteration over the for m in c loop is independent of any other iteration of that loop or the entire logic has to be changed so that the for m in c loop is not needed to begin with.
UPDATE
I was thinking that you could possibly turn the algorithm towards a slightly less sequential version with the following changes. I'm pretty sure the code already does something rather similar, but the code is a little too dense for me and graph algorithms aren't exactly my specialty.
Currently, for a new triple ('101' for instance), you're generating all possible connection points in the existing graph, then adding the new triple to the graph and eliminating nodes based on measuring shortest paths. This requires checking for shortest paths on the graph and modifying, which prevents parallelisation.
NOTE: what follows is a pretty rough outline for how the code could be refactored, I haven't tested this or verified mathematically that it actually works correctly
NOTE 2: In the below discussion '101' (notice the quotes '' is a binary string, so is '00' and '1' where as 1, 0, 4 and so on (without quotes) are vertex labels in the graph.
What if, you instead were to do a kind of a substring search on the existing graph, I'll use the first triple as an example. To initialise
generate a job_queue which contains all triples
take the first one and insert that, for instance, '000' which would be (0, 2, 4) - this is trivial no need to check anything because the graph is empty when you start so the shortest path is by definition the one you insert.
At this point you also have partial paths for '011', '001', '010' and conversely ('110' and '001' because the graph is undirected). We're going to utilise the fact that the existing graph contains sub-solutions to remaining triples in job_queue. Let's say the next triple is '010', you iterate over the binary string '010' or list('010')
if a path/vertex for '0' already exists in the graph --> continue
if a path/vertices for '01' already exists in the graph --> continue
if a path/vertices for '010' exists you're done, no need to add anything (this is actually a failure case: '010' should not have been in the job queue anymore because it was already solved).
The second bullet point would fail because '01' does not exist in the graph. Insert '1' which in this case would be node 1 to the graph and connect it to one of the three even nodes, I don't think it matters which one but you have to record which one it was connected to, let's say you picked 0. The graph now looks something like
0 - 2 - 4
\ *
\ *
\*
1
The optimal edge to complete the path is 1 - 2 (marked with stars) to get a path 0 - 1 - 2 for '010', this is the path that maximises the number of triples encoded, if the edge 1-2 is added to the graph. If you add 1-4 you encode only the '010' triple, where as 1 - 2 encodes '010' but also '001' and '100'.
As an aside, let's pretend you connected 1 to 2 at first, instead of 0 (the first connection was picked random), you now have a graph
0 - 2 - 4
|
|
|
1
and you can connect 1 to either 4 or to 0, but you again get a graph that encodes the maximum number of triples remaining in job_queue.
So how do you check how many triples a potential new path encodes? You can check for this relatively easily and more importantly the check can be done in parallel without modifying the graph g, for 3bit strings the savings from parallel aren't that big, but for 32bit strings they would be. Here's how it works.
(sequential) generate all possible complete paths from the sub-path 0-1 -> (0-1-2), (0-1-4).
(parallel) for each potential complete path check how many other triples that path solves, i.e. for each path candidate generate all the triples that the graph solves and check if those triples are still in job_queue.
(0-1-2) solves two other triples '001' (4-2-1) or (2-0-1) and '100' (1-0-2) or (1-2-4).
(0-1-4) only solved the triple '010', i.e. itself
the edge/path that solves the most triples remaining in job_queue is the optimal solution (I don't have a proof this).
You run 2. above in parallel copying the graph to each worker. Because you're not modifying the graph, only checking how many triples it solves, you can do this in parallel. Each worker should have a signature something like
check_graph(g, path, copy_of_job_queue):
# do some magic
return (n_paths_solved, paths_solved)
path is either (0-1-2) or (0-1-4), copy_of_job_queue should be a copy of the remaining paths on the job_queue. For K workers you create K copies of the queue. Once the worker pool finishes you know which path (0-1-2) or (0-1-4) solves the most triples.
You then add that path and modify the graph, and remove the solved paths from the job queue.
RINSE - REPEAT until job queue is empty.
There's a few obvious problems with the above, for one your doing a lot of copying and looping over of job_queue, if you're dealing with large bit spaces, say 32bits, then job_queue is pretty long, so you might want to not keep copying to all the workers.
For the parallel step above (2.) you might want to have job_queue actually be a dict where the key is the triple, say '010', and the value is a boolean flag saying if that triple is already encoded in the graph.

Is there a faster algorithm? Looking at these two trees, (i've represented the numbers in binary to make the paths easier to see). Now to reduce this from 14 nodes to 7 nodes, can you layer the required paths from one tree onto the other? You can add any edge you like to one of the trees as long as it doesn't connect a node with its ancestors.
_ 000
_ 00 _/
/ \_ 001
0 _ 010
\_ 01 _/
\_ 011
_ 100
_ 10 _/
/ \_ 101
1 _ 110
\_ 11 _/
\_ 111
can you see for example connecting 01 to 00, would be similar to replacing the head of the tree's 0 with 01, and thus with one edge you have added 100, 101 and 110..

Iterate over two lists, execute function and return values

I am trying to iterate over two lists of the same length, and for the pair of entries per index, execute a function. The function aims to cluster the entries
according to some requirement X on the value the function returns.
The lists in questions are:
e_list = [-0.619489,-0.465505, 0.124281, -0.498212, -0.51]
p_list = [-1.7836,-1.14238, 1.73884, 1.94904, 1.84]
and the function takes 4 entries, every combination of l1 and l2.
The function is defined as
def deltaR(e1, p1, e2, p2):
de = e1 - e2
dp = p1 - p2
return de*de + dp*dp
I have so far been able to loop over the lists simultaneously as:
for index, (eta, phi) in enumerate(zip(e_list, p_list)):
for index2, (eta2, phi2) in enumerate(zip(e_list, p_list)):
if index == index2: continue # to avoid same indices
if deltaR(eta, phi, eta2, phi2) < X:
print (index, index2) , deltaR(eta, phi, eta2, phi2)
This loops executes the function on every combination, except those that are same i.e. index 0,0 or 1,1 etc
The output of the code returns:
(0, 1) 0.659449892453
(1, 0) 0.659449892453
(2, 3) 0.657024790285
(2, 4) 0.642297230697
(3, 2) 0.657024790285
(3, 4) 0.109675332432
(4, 2) 0.642297230697
(4, 3) 0.109675332432
I am trying to return the number of indices that are all matched following the condition above. In other words, to rearrange the output to:
output = [No. matched entries]
i.e.
output = [2, 3]
2 coming from the fact that indices 0 and 1 are matched
3 coming from the fact that indices 2, 3, and 4 are all matched
A possible way I have thought of is to append to a list, all the indices used such that I return
output_list = [0, 1, 1, 0, 2, 3, 4, 3, 2, 4, 4, 2, 3]
Then, I use defaultdict to count the occurrances:
for index in output_list:
hits[index] += 1
From the dict I can manipulate it to return [2,3] but is there a more pythonic way of achieving this?

This is finding connected components of a graph, which is very easy and well documented, once you revisit the problem from that view.
The data being in two lists is a distraction. I am going to consider the data to be zip(e_list, p_list). Consider this as a graph, which in this case has 5 nodes (but could have many more on a different data set). Construct the graph using these nodes, and connected them with an edge if they pass your distance test.
From there, you only need to determine the connected components of an undirected graph, which is covered on many many places. Here is a basic depth first search on this site: Find connected components in a graph
You loop through the nodes once, performing a DFS to find all connected nodes. Once you look at a node, mark it visited, so it does not get counted again. To get the answer in the format you want, simply count the number of unvisited nodes found from each unvisited starting point, and append that to a list.
------------------------ graph theory ----------------------
You have data points that you want to break down into related groups. This is a topic in both mathematics and computer science known as graph theory. see: https://en.wikipedia.org/wiki/Graph_theory
You have data points. Imagine drawing them in eta phi space as rectangular coordinates, and then draw lines between the points that are close to each other. You now have a "graph" with vertices and edges.
To determine which of these dots have lines between them is finding connected components. Obviously it's easy to see, but if you have thousands of points, and you want a computer to find the connected components quickly, you use graph theory.
Suppose I make a list of all the eta phi points with zip(e_list, p_list), and each entry in the list is a vertex. If you store the graph in "adjacency list" format, then each vertex will also have a list of the outgoing edges which connect it to another vertex.
Finding a connected component is literally as easy as looking at each vertex, putting a checkmark by it, and then following every line to the next vertex and putting a checkmark there, until you can't find anything else connected. Now find the next vertex without a checkmark, and repeat for the next connected component.
As a programmer, you know that writing your own data structures for common problems is a bad idea when you can use published and reviewed code to handle the task. Google "python graph module". One example mentioned in comments is "pip install networkx". If you build the graph in networkx, you can get the connected components as a list of lists, then take the len of each to get the format you want: [len(_) for _ in nx.connected_components(G)]
---------------- code -------------------
But if you don't understand the math, then you might not understand a module for graphs, nor a base python implementation, but it's pretty easy if you just look at some of those links. Basically dots and lines, but pretty useful when you apply the concepts, as you can see with your problem being nothing but a very simple graph theory problem in disguise.
My graph is a basic list here, so the vertices don't actually have names. They are identified by their list index.
e_list = [-0.619489,-0.465505, 0.124281, -0.498212, -0.51]
p_list = [-1.7836,-1.14238, 1.73884, 1.94904, 1.84]
def deltaR(e1, p1, e2, p2):
de = e1 - e2
dp = p1 - p2
return de*de + dp*dp
X = 1 # you never actually said, but this works
def these_two_particles_are_going_the_same_direction(p1, p2):
return deltaR(p1.eta, p1.phi, p2.eta, p2.phi) < X
class Vertex(object):
def __init__(self, eta, phi):
self.eta = eta
self.phi = phi
self.connected = []
self.visited = False
class Graph(object):
def __init__(self, e_list, p_list):
self.vertices = []
for eta, phi in zip(e_list, p_list):
self.add_node(eta, phi)
def add_node(self, eta, phi):
# add this data point at the next available index
n = len(self.vertices)
a = Vertex(eta, phi)
for i, b in enumerate(self.vertices):
if these_two_particles_are_going_the_same_direction(a,b):
b.connected.append(n)
a.connected.append(i)
self.vertices.append(a)
def reset_visited(self):
for v in self.nodes:
v.visited = False
def DFS(self, n):
#perform depth first search from node n, return count of connected vertices
count = 0
v = self.vertices[n]
if not v.visited:
v.visited = True
count += 1
for i in v.connected:
count += self.DFS(i)
return count
def connected_components(self):
self.reset_visited()
components = []
for i, v in enumerate(self.vertices):
if not v.visited:
components.append(self.DFS(i))
return components
g = Graph(e_list, p_list)
print g.connected_components()

How can I choose random indicies based on probability?

I have a list of numbers and I'm trying to write a function that will choose n random indices i such i's likelihood is percentages[i]
Function:
def choose_randomly(probabilities, n):
percentages = accumulated_s(probabilities)
result = []
for i in range(n):
r = random()
for j in range(n):
if r < percentages[j]:
result = result + [j]
return result
accumulated_s will just generate a corresponding list of probabilities.
I'm expecting results like this:
choose_randomly([1, 2, 3, 4], 2) -> [3 3 0]
choose_randomly([1, 2, 3, 4], 2) -> [1 3 1]
The problem is that this is not returning n indicies. Can anyone point out what I'm doing wrong?
Thank you so much!

Once you've found the right range of probabilities, you're done; break out of the inner loop to generate the next value, or you'll act as if all probabilities above the correct threshold were matched as well:
# Enumerate all percentages, not just first n
for j, pct in enumerate(percentages):
if r < pct:
result.append(j) # Don't create tons of temporary lists; mutate in place
break # <-- Don't add more results
Also note, if you have a lot of values in the set of probabilities, it may make sense to use functions from the bisect module to find the correct value, rather than scanning linearly each time; for a small number of entries in percentages, linear scanning is fine, but for a large number, O(log n) lookups may beat O(n) scans.

Random contiguous slice of list in Python based on a single random integer

Using a single random number and a list, how would you return a random slice of that list?
For example, given the list [0,1,2] there are seven possibilities of random contiguous slices:
[ ]
[ 0 ]
[ 0, 1 ]
[ 0, 1, 2 ]
[ 1 ]
[ 1, 2]
[ 2 ]
Rather than getting a random starting index and a random end index, there must be a way to generate a single random number and use that one value to figure out both starting index and end/length.
I need it that way, to ensure these 7 possibilities have equal probability.

Simply fix one order in which you would sort all possible slices, then work out a way to turn an index in that list of all slices back into the slice endpoints. For example, the order you used could be described by
The empty slice is before all other slices
Non-empty slices are ordered by their starting point
Slices with the same starting point are ordered by their endpoint
So the index 0 should return the empty list. Indices 1 through n should return [0:1] through [0:n]. Indices n+1 through n+(n-1)=2n-1 would be [1:2] through [1:n]; 2n through n+(n-1)+(n-2)=3n-3 would be [2:3] through [2:n] and so on. You see a pattern here: the last index for a given starting point is of the form n+(n-1)+(n-2)+(n-3)+…+(n-k), where k is the starting index of the sequence. That's an arithmetic series, so that sum is (k+1)(2n-k)/2=(2n+(2n-1)k-k²)/2. If you set that term equal to a given index, and solve that for k, you get some formula involving square roots. You could then use the ceiling function to turn that into an integral value for k corresponding to the last index for that starting point. And once you know k, computing the end point is rather easy.
But the quadratic equation in the solution above makes things really ugly. So you might be better off using some other order. Right now I can't think of a way which would avoid such a quadratic term. The order Douglas used in his answer doesn't avoid square roots, but at least his square root is a bit simpler due to the fact that he sorts by end point first. The order in your question and my answer is called lexicographical order, his would be called reverse lexicographical and is often easier to handle since it doesn't depend on n. But since most people think about normal (forward) lexicographical order first, this answer might be more intuitive to many and might even be the required way for some applications.
Here is a bit of Python code which lists all sequence elements in order, and does the conversion from index i to endpoints [k:m] the way I described above:
from math import ceil, sqrt
n = 3
print("{:3} []".format(0))
for i in range(1, n*(n+1)//2 + 1):
b = 1 - 2*n
c = 2*(i - n) - 1
# solve k^2 + b*k + c = 0
k = int(ceil((- b - sqrt(b*b - 4*c))/2.))
m = k + i - k*(2*n-k+1)//2
print("{:3} [{}:{}]".format(i, k, m))
The - 1 term in c doesn't come from the mathematical formula I presented above. It's more like subtracting 0.5 from each value of i. This ensures that even if the result of sqrt is slightly too large, you won't end up with a k which is too large. So that term accounts for numeric imprecision and should make the whole thing pretty robust.
The term k*(2*n-k+1)//2 is the last index belonging to starting point k-1, so i minus that term is the length of the subsequence under consideration.
You can simplify things further. You can perform some computation outside the loop, which might be important if you have to choose random sequences repeatedly. You can divide b by a factor of 2 and then get rid of that factor in a number of other places. The result could look like this:
from math import ceil, sqrt
n = 3
b = n - 0.5
bbc = b*b + 2*n + 1
print("{:3} []".format(0))
for i in range(1, n*(n+1)//2 + 1):
k = int(ceil(b - sqrt(bbc - 2*i)))
m = k + i - k*(2*n-k+1)//2
print("{:3} [{}:{}]".format(i, k, m))

It is a little strange to give the empty list equal weight with the others. It is more natural for the empty list to be given weight 0 or n+1 times the others, if there are n elements on the list. But if you want it to have equal weight, you can do that.
There are n*(n+1)/2 nonempty contiguous sublists. You can specify these by the end point, from 0 to n-1, and the starting point, from 0 to the endpoint.
Generate a random integer x from 0 to n*(n+1)/2.
If x=0, return the empty list. Otherwise, x is unformly distributed from 1 through n(n+1)/2.
Compute e = floor(sqrt(2*x)-1/2). This takes the values 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, etc.
Compute s = (x-1) - e*(e+1)/2. This takes the values 0, 0, 1, 0, 1, 2, 0, 1, 2, 3, ...
Return the interval starting at index s and ending at index e.
(s,e) takes the values (0,0),(0,1),(1,1),(0,2),(1,2),(2,2),...
import random
import math
n=10
x = random.randint(0,n*(n+1)/2)
if (x==0):
print(range(n)[0:0]) // empty set
exit()
e = int(math.floor(math.sqrt(2*x)-0.5))
s = int(x-1 - (e*(e+1)/2))
print(range(n)[s:e+1]) // starting at s, ending at e, inclusive

First create all possible slice indexes.
[0:0], [1:1], etc are equivalent, so we include only one of those.
Finally you pick a random index couple, and apply it.
import random
l = [0, 1, 2]
combination_couples = [(0, 0)]
length = len(l)
# Creates all index couples.
for j in range(1, length+1):
for i in range(j):
combination_couples.append((i, j))
print(combination_couples)
rand_tuple = random.sample(combination_couples, 1)[0]
final_slice = l[rand_tuple[0]:rand_tuple[1]]
print(final_slice)
To ensure we got them all:
for i in combination_couples:
print(l[i[0]:i[1]])
Alternatively, with some math...
For a length-3 list there are 0 to 3 possible index numbers, that is n=4. You have 2 of them, that is k=2. First index has to be smaller than second, therefor we need to calculate the combinations as described here.
from math import factorial as f
def total_combinations(n, k=2):
result = 1
for i in range(1, k+1):
result *= n - k + i
result /= f(k)
# We add plus 1 since we included [0:0] as well.
return result + 1
print(total_combinations(n=4)) # Prints 7 as expected.

there must be a way to generate a single random number and use that one value to figure out both starting index and end/length.
It is difficult to say what method is best but if you're only interested in binding single random number to your contiguous slice you can use modulo.
Given a list l and a single random nubmer r you can get your contiguous slice like that:
l[r % len(l) : some_sparkling_transformation(r) % len(l)]
where some_sparkling_transformation(r) is essential. It depents on your needs but since I don't see any special requirements in your question it could be for example:
l[r % len(l) : (2 * r) % len(l)]
The most important thing here is that both left and right edges of the slice are correlated to r. This makes a problem to define such contiguous slices that wont follow any observable pattern. Above example (with 2 * r) produces slices that are always empty lists or follow a pattern of [a : 2 * a].
Let's use some intuition. We know that we want to find a good random representation of the number r in a form of contiguous slice. It cames out that we need to find two numbers: a and b that are respectively left and right edges of the slice. Assuming that r is a good random number (we like it in some way) we can say that a = r % len(l) is a good approach.
Let's now try to find b. The best way to generate another nice random number will be to use random number generator (random or numpy) which supports seeding (both of them). Example with random module:
import random
def contiguous_slice(l, r):
random.seed(r)
a = int(random.uniform(0, len(l)+1))
b = int(random.uniform(0, len(l)+1))
a, b = sorted([a, b])
return l[a:b]
Good luck and have fun!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.