I have an algorithm that creates a graph that has all representations of 3-bit binary strings encoded in the form of the shortest graph paths, where an even number in the path means 0, while an odd number means 1:
from itertools import permutations, product
import networkx as nx
import progressbar
import itertools
def groups(sources, template):
func = permutations
keys = sources.keys()
combos = [func(sources[k], template.count(k)) for k in keys]
for t in product(*combos):
d = {k: iter(n) for k, n in zip(keys, t)}
yield [next(d[k]) for k in template]
g = nx.Graph()
added = []
good = []
index = []
# I create list with 3-bit binary strings
# I do not include one of the pairs of binary strings that have a mirror image
list_1 = [list(i) for i in itertools.product(tuple(range(2)), repeat=3) if tuple(reversed(i)) >= tuple(i)]
count = list(range(len(list_1)))
h = 0
while len(added) < len(list_1):
# In each next step I enlarge the list 'good` by the next even and odd number
if h != 0:
for q in range(2):
good.append([i for i in good if i%2 == q][-1] + 2)
# I create a list `c` with string indices from the list` list_1`, that are not yet used.
# Whereas the `index` list stores the numbering of strings from the list` list_1`, whose representations have already been correctly added to the `added` list.
c = [item for item in count if item not in index]
for m in c:
# I create representations of binary strings, where 0 is 'v0' and 1 is 'v1'. For example, the '001' combination is now 'v0v0v1'
a = ['v{}'.format(x%2) for x in list_1[m]]
if h == 0:
for w in range(2):
if len([i for i in good if i%2 == w]) < a.count('v{}'.format(w)):
for j in range(len([i for i in good if i%2 == w]), a.count('v{}'.format(w))):
good.insert(j,2*j + w)
sources={}
for x in range(2):
sources["v{0}".format(x)] = [n for n in good if n%2 == x]
# for each representation in the form 'v0v0v1' for example, I examine all combinations of strings where 'v0' is an even number 'a' v1 'is an odd number, choosing values from the' dobre2 'list and checking the following conditions.
for aaa_binary in groups(sources, a):
# Here, the edges and nodes are added to the graph from the combination of `aaa_binary` and checking whether the combination meets the conditions. If so, it is added to the `added` list. If not, the newly added edges are removed and the next `aaa_binary` combination is taken.
g.add_nodes_from (aaa_binary)
t1 = (aaa_binary[0],aaa_binary[1])
t2 = (aaa_binary[1],aaa_binary[2])
added_now = []
for edge in (t1,t2):
if not g.has_edge(*edge):
g.add_edge(*edge)
added_now.append(edge)
added.append(aaa_binary)
index.append(m)
for f in range(len(added)):
if nx.shortest_path(g, aaa_binary[0], aaa_binary[2]) != aaa_binary or nx.shortest_path(g, added[f][0], added[f][2]) != added[f]:
for edge in added_now:
g.remove_edge(*edge)
added.remove(aaa_binary)
index.remove(m)
break
# Calling a good combination search interrupt if it was found and the result added to the `added` list, while the index from the list 'list_1` was added to the` index` list
if m in index:
break
good.sort()
set(good)
index.sort()
h = h+1
Output paths representing 3-bit binary strings from added:
[[0, 2, 4], [0, 2, 1], [2, 1, 3], [1, 3, 5], [0, 3, 6], [3, 0, 7]]
So these are representations of 3-bit binary strings:
[[0, 0, 0], [0, 0, 1], [0, 1, 1], [1, 1, 1], [0, 1, 0], [1, 0, 1]]
Where in the step h = 0 the first 4 sub-lists were found, and in the step h = 1 the last two sub-lists were added.
Of course, as you can see, there are no reflections of the mirrored strings, because there is no such need in an undirected graph.
Graph:
The above solution creates a minimal graph and with the unique shortest paths. This means that one combination of a binary string has only one representation on the graph in the form of the shortest path. So the choice of a given path is a single-pointing indication of a given binary sequence.
Now I would like to use multiprocessing on the for m in c loop, because the order of finding elements does not matter here.
I try to use multiprocessing in this way:
from multiprocessing import Pool
added = []
def foo(i):
added = []
# do something
added.append(x[i])
return added
if __name__ == '__main__':
h = 0
while len(added)<len(c):
pool = Pool(4)
result = pool.imap_unordered(foo, c)
added.append(result[-1])
pool.close()
pool.join()
h = h + 1
Multiprocessing takes place in the while-loop, and in the foo function, the
added list is created. In each subsequent step h in the loop, the listadded should be incremented by subsequent values, and the current list added should be used in the functionfoo. Is it possible to pass the current contents of the list to the function in each subsequent step of the loop? Because in the above code, the foo function creates the new contents of the added list from scratch each time. How can this be solved?
Which in consequence gives bad results:
[[0, 2, 4], [0, 2, 1], [2, 1, 3], [1, 3, 5], [0, 1, 2], [1, 0, 3]]
Because for such a graph, nodes and edges, the condition is not met that nx.shortest_path (graph, i, j) == added[k] for every final nodes i, j from added[k] for k in added list.
Where for h = 0 to the elements [0, 2, 4], [0, 2, 1], [2, 1, 3], [1, 3, 5] are good, while elements added in the steph = 1, ie [0, 1, 2], [1, 0, 3] are evidently found without affecting the elements from the previous step.
How can this be solved?
I realize that this is a type of sequential algorithm, but I am also interested in partial solutions, i.e. parallel processes even on parts of the algorithm. For example, that the steps of h while looping run sequentially, but thefor m in c loop is multiprocessing. Or other partial solutions that will improve the entire algorithm for larger combinations.
I will be grateful for showing and implementing some idea for the use of multiprocessing in my algorithm.
I don't think you can parallelise the code as it is currently. The part that you're wanting to parallelise, the for m in c loop accesses three lists that are global good, added and index and the graph g itself. You could use a multiprocessing.Array for the lists, but that would undermine the whole point of parallelisation as multiprocessing.Array (docs) is synchronised, so the processes would not actually be running in parallel.
So, the code needs to be refactored. My preferred way of parallelising algorithms is to use a kind of a producer / consumer pattern
initialisation to set up a job queue that needs to be executed (runs sequentially)
have a pool of workers that all pull jobs from that queue (runs in parallel)
after the job queue has been exhausted, aggregate results and build up the final solution (runs sequentially)
In this case 1. would be the setup code for list_1, count and probably the h == 0 case. After that you would build a queue of "job orders", this would be the c list -> pass that list to a bunch of workers -> get the results back and aggregate. The problem is that each execution of the for m in c loop has access to global state and the global state changes after each iteration. This logically means that you can not run the code in parallel, because the first iteration changes the global state and affects what the second iteration does. That is, by definition, a sequential algorithm. You can not, at least not easily, parallelise an algorithm that iteratively builds a graph.
You could use multiprocessing.starmap and multiprocessing.Array, but that doesn't solve the problem. You still have the graph g which is also shared between all processes. So the whole thing would need to be refactored in such a way that each iteration over the for m in c loop is independent of any other iteration of that loop or the entire logic has to be changed so that the for m in c loop is not needed to begin with.
UPDATE
I was thinking that you could possibly turn the algorithm towards a slightly less sequential version with the following changes. I'm pretty sure the code already does something rather similar, but the code is a little too dense for me and graph algorithms aren't exactly my specialty.
Currently, for a new triple ('101' for instance), you're generating all possible connection points in the existing graph, then adding the new triple to the graph and eliminating nodes based on measuring shortest paths. This requires checking for shortest paths on the graph and modifying, which prevents parallelisation.
NOTE: what follows is a pretty rough outline for how the code could be refactored, I haven't tested this or verified mathematically that it actually works correctly
NOTE 2: In the below discussion '101' (notice the quotes '' is a binary string, so is '00' and '1' where as 1, 0, 4 and so on (without quotes) are vertex labels in the graph.
What if, you instead were to do a kind of a substring search on the existing graph, I'll use the first triple as an example. To initialise
generate a job_queue which contains all triples
take the first one and insert that, for instance, '000' which would be (0, 2, 4) - this is trivial no need to check anything because the graph is empty when you start so the shortest path is by definition the one you insert.
At this point you also have partial paths for '011', '001', '010' and conversely ('110' and '001' because the graph is undirected). We're going to utilise the fact that the existing graph contains sub-solutions to remaining triples in job_queue. Let's say the next triple is '010', you iterate over the binary string '010' or list('010')
if a path/vertex for '0' already exists in the graph --> continue
if a path/vertices for '01' already exists in the graph --> continue
if a path/vertices for '010' exists you're done, no need to add anything (this is actually a failure case: '010' should not have been in the job queue anymore because it was already solved).
The second bullet point would fail because '01' does not exist in the graph. Insert '1' which in this case would be node 1 to the graph and connect it to one of the three even nodes, I don't think it matters which one but you have to record which one it was connected to, let's say you picked 0. The graph now looks something like
0 - 2 - 4
\ *
\ *
\*
1
The optimal edge to complete the path is 1 - 2 (marked with stars) to get a path 0 - 1 - 2 for '010', this is the path that maximises the number of triples encoded, if the edge 1-2 is added to the graph. If you add 1-4 you encode only the '010' triple, where as 1 - 2 encodes '010' but also '001' and '100'.
As an aside, let's pretend you connected 1 to 2 at first, instead of 0 (the first connection was picked random), you now have a graph
0 - 2 - 4
|
|
|
1
and you can connect 1 to either 4 or to 0, but you again get a graph that encodes the maximum number of triples remaining in job_queue.
So how do you check how many triples a potential new path encodes? You can check for this relatively easily and more importantly the check can be done in parallel without modifying the graph g, for 3bit strings the savings from parallel aren't that big, but for 32bit strings they would be. Here's how it works.
(sequential) generate all possible complete paths from the sub-path 0-1 -> (0-1-2), (0-1-4).
(parallel) for each potential complete path check how many other triples that path solves, i.e. for each path candidate generate all the triples that the graph solves and check if those triples are still in job_queue.
(0-1-2) solves two other triples '001' (4-2-1) or (2-0-1) and '100' (1-0-2) or (1-2-4).
(0-1-4) only solved the triple '010', i.e. itself
the edge/path that solves the most triples remaining in job_queue is the optimal solution (I don't have a proof this).
You run 2. above in parallel copying the graph to each worker. Because you're not modifying the graph, only checking how many triples it solves, you can do this in parallel. Each worker should have a signature something like
check_graph(g, path, copy_of_job_queue):
# do some magic
return (n_paths_solved, paths_solved)
path is either (0-1-2) or (0-1-4), copy_of_job_queue should be a copy of the remaining paths on the job_queue. For K workers you create K copies of the queue. Once the worker pool finishes you know which path (0-1-2) or (0-1-4) solves the most triples.
You then add that path and modify the graph, and remove the solved paths from the job queue.
RINSE - REPEAT until job queue is empty.
There's a few obvious problems with the above, for one your doing a lot of copying and looping over of job_queue, if you're dealing with large bit spaces, say 32bits, then job_queue is pretty long, so you might want to not keep copying to all the workers.
For the parallel step above (2.) you might want to have job_queue actually be a dict where the key is the triple, say '010', and the value is a boolean flag saying if that triple is already encoded in the graph.
Is there a faster algorithm? Looking at these two trees, (i've represented the numbers in binary to make the paths easier to see). Now to reduce this from 14 nodes to 7 nodes, can you layer the required paths from one tree onto the other? You can add any edge you like to one of the trees as long as it doesn't connect a node with its ancestors.
_ 000
_ 00 _/
/ \_ 001
0 _ 010
\_ 01 _/
\_ 011
_ 100
_ 10 _/
/ \_ 101
1 _ 110
\_ 11 _/
\_ 111
can you see for example connecting 01 to 00, would be similar to replacing the head of the tree's 0 with 01, and thus with one edge you have added 100, 101 and 110..
Related
I have a code which gets a number of triangles in an Undirected Graph using matrix multiplication method. Now I would like it to also print these triangles, preferably to print those vertexes. It could be done with third party libraries, e.g. numpy or networkx, but it has to be done with matrix multiplication, as I know that I could do it with naive version.
To make it simplier I will use the easiest adjacency matrix:
[[0, 1, 0, 0],
[1, 0, 1, 1],
[0, 1, 0, 1],
[0, 1, 1, 0]]
it has edges:
x,y
0,1
1,2
1,3
2,3
So the triangle exsists between vertexes 1,2,3 and this is what I would like this program ALSO prints to the console
Now the code, which just prints how many triangles are in this graph:
# num of vertexes
V = 4
# graph from adjacency matrix
graph = [[0, 1, 0, 0],
[1, 0, 1, 1],
[0, 1, 0, 1],
[0, 1, 1, 0]]
# get the vertexes in a dict
vertexes = {}
for i in range(len(graph)):
vertexes[i] = i
print(vertexes)
## >> {0: 0, 1: 1, 2: 2, 3: 3}
# matrix multiplication
def multiply(A, B, C):
global V
for i in range(V):
for j in range(V):
C[i][j] = 0
for k in range(V):
C[i][j] += A[i][k] * B[k][j]
# Utility function to calculate
# trace of a matrix (sum of
# diagonal elements)
def getTrace(graph):
global V
trace = 0
for i in range(V):
trace += graph[i][i]
return trace
# Utility function for calculating
# number of triangles in graph
def triangleInGraph(graph):
global V
# To Store graph^2
aux2 = [[None] * V for _ in range(V)]
# To Store graph^3
aux3 = [[None] * V for i in range(V)]
# Initialising aux
# matrices with 0
for i in range(V):
for j in range(V):
aux2[i][j] = aux3[i][j] = 0
# aux2 is graph^2 now printMatrix(aux2)
multiply(graph, graph, aux2)
# after this multiplication aux3 is
# graph^3 printMatrix(aux3)
multiply(graph, aux2, aux3)
trace = getTrace(aux3)
return trace // 6
print("Total number of Triangle in Graph :",
triangleInGraph(graph))
## >> Total number of Triangle in Graph : 1
The thing is, the information of the triangle (more generally speaking, information of paths between a vertex i and a vertex j) is lost during that matrix multiplication process. All that is stored is that the path exist.
For adjacency matrix itself, whose numbers are the number of length 1 paths between i and j, answer is obvious, because if a path exists, then it has to be edge (i,j). But even in M², when you see number 2 at row i column j of M², well, all you know is that there are 2 length 2 paths connecting i to j. So, that it exists 2 different index k₁ and k₂, such as (i,k₁) and (k₁,j) are edges, and so are (i,k₂) and (k₂, j).
That is exactly why matrix multiplication works (and that is a virtue of coding as explicitly as you did: I don't need to recall you that element M²ᵢⱼ = ΣMᵢₖ×Mₖⱼ
So it is exactly that: 1 for all intermediate vertex k such as (i,k) and (k,j) are both edges. So 1 for all intermediate vertex k such as (i,k),(k,j) is a length 2 path for i to j.
But as you can see, that Σ is just a sum. In a sum, we loose the detail of what contributed to the sum.
In other words, nothing to do from what you computed. You've just computed the number of length-3 path from i to j, for all i and j, and, in particular what you are interested in, the number of length-3 paths from i to i for all i.
So the only solution you have, is to write another algorithm, that does a completely different computation (but makes yours useless: why compute the number of paths, when you have, or you will compute the list of paths?).
That computation is a rather classic one: you are just looking for paths from a node to another. Only, those two nodes are the same.
Nevertheless the most classical algorithm (Dijkstra, Ford, ...) are not really useful here (you are not searching the shortest one, and you want all paths, not just one).
One method I can think of, is to start nevertheless ("nevertheless" because I said earlier that your computing of length of path was redundant) from your code. Not that it is the easiest way, but now that your code is here; besides, I allways try to stay as close as possible from the original code
Compute a matrix of path
As I've said earlier, the formula ΣAᵢₖBₖⱼ makes sense: it is computing the number of cases where we have some paths (Aᵢₖ) from i to k and some other paths (Bₖⱼ) from k to j.
You just have to do the same thing, but instead of summing a number, sum a list of paths.
For the sake of simplicity, here, I'll use lists to store paths. So path i,k,j is stored in a list [i,k,j]. So in each cell of our matrix we have a list of paths, so a list of list (so since our matrix is itself implemented as a list of list, that makes the path matrix a list of list of list of list)
The path matrix (I made up the name just now. But I am pretty sure it has already an official name, since the idea can't be new. And that official name is probably "path matrix") for the initial matrix is very simple: each element is either [] (no path) where Mᵢⱼ is 0, and is [[i,j]] (1 path, i→j) where Mᵢⱼ is 1.
So, let's build it
def adjacencyToPath(M):
P=[[[] for _ in range(len(M))] for _ in range(len(M))]
for i in range(len(M)):
for j in range(len(M)):
if M[i][j]==1:
P[i][j]=[[i,j]]
else:
P[i][j]=[]
return P
Now that you've have that, we just have to follow the same idea as in the matrix multiplication. For example (to use the most complete example, even if out of your scope, since you don't compute more than M³) when you compute M²×M³, and say M⁵ᵢⱼ = ΣM²ᵢₖM³ₖⱼ that means that if M²ᵢₖ is 3 and M³ₖⱼ is 2, then you have 6 paths of length 5 between i and j whose 3rd step is at node k: all the 6 possible combination of the 3 ways to go from i to k in 3 steps and the 2 ways to go from k to j in 2 steps.
So, let's do also that for path matrix.
# Args=2 list of paths.
# Returns 1 list of paths
# Ex, if p1=[[1,2,3], [1,4,3]] and p2=[[3,2,4,2], [3,4,5,2]]
# Then returns [[1,2,3,2,4,2], [1,2,3,4,5,2], [1,4,3,2,4,2], [1,4,3,4,5,2]]
def combineListPath(lp1, lp2):
res=[]
for p1 in lp1:
for p2 in lp2:
res.append(p1+p2[1:]) # p2[0] is redundant with p1[-1]
return res
And the path matrix multiplication therefore goes like this
def pathMult(P1, P2):
res=[[[] for _ in range(len(P1))] for _ in range(len(P1))]
for i in range(len(P1)):
for j in range(len(P1)):
for k in range(len(P1)):
res[i][j] += combineListPath(P1[i][k], P2[k][j])
return res
So, all we have to do now, is to use this pathMult function as we use the matrix multiplication. As you computed aux2, let compute pm2
pm=adjacencyToPath(graph)
pm2=pathMult(pm, pm)
and as you computed aux3, let's compute pm3
pm3=pathMult(pm, pm2)
And now, you have in pm3, at each cell pm3[i][j] the list of paths of length 3, from i to j. And in particular, in all pm3[i][i] you have the list of triangles.
Now, the advantage of this method is that it mimics exactly your way of computing the number of paths: we do the exact same thing, but instead of retaining the number of paths, we retain the list of them.
Faster way
Obviously there are more efficient way. For example, you could just search pair (i,j) of connected nodes such as there is a third node k connected to both i and j (with an edge (j,k) and an edge (k,i), making no assumption whether your graph is oriented or not).
def listTriangle(M):
res=[]
for i in range(len(M)):
for j in range(i,len(M)):
if M[i][j]==0: continue
# So, at list point, we know i->j is an edge
for k in range(i,len(M)):
if M[j,k]>0 and M[k,i]>0:
res.append( (i,j,k) )
return res
We assume j≥i and k≥i, because triangles (i,j,k), (j,k,i) and (k,i,j) are the same, and exist all or none.
It could be optimized if we make the assumption that we are always in a non-oriented (or at least symmetric) graph, as you example suggest. In which case, we can assume i≤j≤k for example (since triangles (i,j,k) and (i,k,j) are also the same), turning the 3rd for from for k in range(i, len(M)) to for k in range(j, len(M)). And also if we exclude loops (either because there are none, as in your example, or because we don't want to count them as part of a triangle), then you can make the assumption i<j<k. Which then turns the 2 last loops into for j in range(i+1, len(M)) and for k in range(j+1, len(M)).
Optimisation
Last thing I didn't want to introduce until now, to stay as close as possible to your code. It worth mentioning that python already has some matrix manipulation routines, through numpy and the # operator. So it is better to take advantage of it (even tho I took advantage of the fact you reinvented the wheel of matrix multiplication to explain my path multiplication).
Your code, for example, becomes
import numpy as np
graph = np.array([[0, 1, 0, 0],
[1, 0, 1, 1],
[0, 1, 0, 1],
[0, 1, 1, 0]])
# Utility function for calculating
# number of triangles in graph
# That is the core of your code
def triangleInGraph(graph):
return (graph # graph # graph).trace()//6 # numpy magic
# shorter that your version, isn't it?
print("Total number of Triangle in Graph :",
triangleInGraph(graph))
## >> Total number of Triangle in Graph : 1
Mine is harder to optimize that way, but that can be done. We just have to define a new type, PathList, and define what are multiplication and addition of pathlists.
class PathList:
def __init__(self, pl):
self.l=pl
def __mul__(self, b): # That's my previous pathmult
res=[]
for p1 in self.l:
for p2 in b.l:
res.append(p1+p2[1:])
return PathList(res)
def __add__(self,b): # Just concatenation of the 2 lists
return PathList(self.l+b.l)
# For fun, a compact way to print it
def __repr__(self):
res=''
for n in self.l:
one=''
for o in n:
one=one+'→'+str(o)
res=res+','+one[1:]
return '<'+res[1:]+'>'
Using list pathlist (which is just the same list of list as before, but with add and mul operators), we can now redefine our adjacencyToPath
def adjacencyToPath(M):
P=[[[] for _ in range(len(M))] for _ in range(len(M))]
for i in range(len(M)):
for j in range(len(M)):
if M[i][j]==1:
P[i][j]=PathList([[i,j]])
else:
P[i][j]=PathList([])
return P
And now, a bit of numpy magic
pm = np.array(adjacencyToPath(graph))
pm3 = pm#pm#pm
triangles = [pm3[i,i] for i in range(len(pm3))]
pm3 is the matrix of all paths from i to j. So pm3[i,i] are the triangles.
Last remark
Some python remarks on your code.
It is better to compute V from your data, that assuming that coder is coherent when they choose V=4 and a graph 4x4. So V=len(graph) is better
You don't need global V if you don't intend to overwrite V. And it is better to avoid as many global keywords as possible. I am not repeating a dogma here. I've nothing against a global variable from times to times, if we know what we are doing. Besides, in python, there is already a sort of local structure even for global variables (they are still local to the unit), so it is not as in some languages where global variables are a high risks of collision with libraries symbols. But, well, not need to take the risk of overwriting V.
No need neither for the allocate / then write in way you do your matrix multiplication (like for matrix multiplication. You allocate them first, then call matrixmultiplication(source1, source2, dest). You can just return a new matrix. You have a garbage collector now. Well, sometimes it is still a good idea to spare some work to the allocation/garbage collector. Especially if you intended to "recycle" some variables (like in mult(A,A,B); mult(A,B,C); mult(A,C,B) where B is "recycled")
Since the triangles are defined by a sequence o vertices i,j,k such that , we can define the following function:
def find_triangles(adj, n=None):
if n is None:
n = len(adj)
triangles = []
for i in range(n):
for j in range(i + 1, n):
for k in range(j + 1, n):
if (adj[i][j] and adj[j][k] and adj[k][i]):
triangles.append([i, j, k])
return triangles
print("The triangles are: ", find_triangles(graph, V))
## >> The triangles are: [[1, 2, 3]]
Here is a fragment of the code responsible for creating the graph and its edges, depending on whether the edge exists and the condition that validates the shortest paths:
for q in range(len(aaa_binary)):
if len(added)!=i+1:
g.add_nodes_from (aaa_binary[q])
t1 = (aaa_binary[q][0],aaa_binary[q][1])
t2 = (aaa_binary[q][1],aaa_binary[q][2])
t3 = (aaa_binary[q][2],aaa_binary[q][3])
if g.has_edge(*t1)==False and g.has_edge(*t2)==False and g.has_edge(*t3)==False:
g.add_edge(*t1)
g.add_edge(*t2)
g.add_edge(*t3)
added.append([aaa_binary[q],'p'+ str(i)])
for j in range(len(added)):
if nx.shortest_path(g, added[j][0][0], added[j][0][3])!=added[j][0] or nx.shortest_path(g, aaa_binary[q][0], aaa_binary[q][3])!=aaa_binary[q]:
g.remove_edge(*t1)
g.remove_edge(*t2)
g.remove_edge(*t3)
added.remove([aaa_binary[q],'p'+ str(i)])
break
if g.has_edge(*t1)==False and g.has_edge(*t2)==False and g.has_edge(*t3)==True:
g.add_edge(*t1)
g.add_edge(*t2)
added.append([aaa_binary[q],'p'+ str(i)])
for j in range(len(added)):
if nx.shortest_path(g, added[j][0][0], added[j][0][3])!=added[j][0] or nx.shortest_path(g, aaa_binary[q][0], aaa_binary[q][3])!=aaa_binary[q]:
g.remove_edge(*t1)
g.remove_edge(*t2)
added.remove([aaa_binary[q],'p'+ str(i)])
break
# ... and then the rest of the False and True possibilities combinations in the `if g.has_edge()'condition.
added[] - list of currently valid paths in the form [[[0, 2, 4, 6], 'p0'], [[0, 2, 4, 1], 'p1'],...]
aaa_binary[] - list of path combinations to check in the form [[0, 2, 4, 6], [0, 2, 6, 4], [0, 4, 2, 6],...]
Loop operation:
The algorithm selects one sublist from the aaa_binary list, then adds nodes to the graph and creates edges. Then the algorithm checks if the given edge exists. If it does not exist, it adds it to the graph, if it exists, it does not add. Then, if the condition of the shortest path is not met, only the newly added edge is removed from the graph. And so until you find the right path from the aaa_binary list.
As you can see only with the four-element sublistors, there are 8 different combinations of False and True in the condition if g.has_edge () in the aaa_binary list, which already makes a technical problem. However, I would like to develop this to check, for example, eight-element paths, and then the combination will be 128! Which is obvious that I can not do it in the current way.
And I care that the loop necessarily adds only non-existent edges, because then it is easier to control the creation of the optimal graph.
Hence my question, is it possible to write such a loop differently and automate it more? I will be very grateful for any comments.
How about:
added_now = []
for edge in (t1,t2,t3):
if not g.has_edge(*edge):
g.add_edge(*edge)
added_now.append(edge)
added.append([aaa_binary[q],'p'+ str(i)])
for j in range(len(added)):
if nx.shortest_path(g, added[j][0][0], added[j][0][3])!=added[j][0] or nx.shortest_path(g, aaa_binary[q][0], aaa_binary[q][3])!=aaa_binary[q]:
for edge in added_now:
g.remove_edge(*edge)
added.remove([aaa_binary[q],'p'+ str(i)])
You just want to do the same for each edge that wasn't added.
Does this solution suits you ?
It doesn't block you from your 4-elem paths. It adjusts to the len of the current aaa_binary[q]. If you want to choose a n-elem paths, it should be easily modifiable. :)
It doesn't have a non-ending list of if.
for q in range(len(aaa_binary)):
if len(added)!=i+1:
g.add_nodes_from(aaa_binary[q])
# Instead of having hard-coded variable, make a list.
tn = []
for idx in range(0, len(aaa_binary[q]) - 1):
# Linking the current elem, to the next one.
# The len() - 1 avoids the iteration on the last elem,
# that will not have another elem after it.
tn.append((aaa_binary[q][idx], aaa_binary[q][idx + 1]))
# Instead of checking each and every case, try to make your
# task 'general'. Here, you want to add the edge that doesn't exist.
indexSaver = []
for index, item in enumerate(tn):
if g.has_edge(*item):
g.add_edge(*item)
# This line is here to keep in mind which item we added,
# Since we do not want to execute `.has_edge` multiple times.
indexSaver.append(index)
# This line is quite unclear as we do not know what is 'added',
# neither 'i' in your code. So I will let it as is.
added.append([aaa_binary[q], 'p' + str(i)])
# Now that non-existent edges have been added...
# I don't understand this part. So we will just modify the [3]
# index that was seemingly here to specify the last index.
for j in range(len(added)):
lastIndex = len(added) - 1
if nx.shortest_path(g, added[j][0][0], added[j][0][lastIndex])!=added[j][0] or nx.shortest_path(g, aaa_binary[q][0], aaa_binary[q][lastIndex])!=aaa_binary[q]:
# On the same logic of adding edges, we delete them.
for idx, item in enumerate(tn):
if idx in indexSaver:
g.remove_edge(*item)
added.remove([aaa_binary[q], 'p' + str(i)])
break
I am trying to make a vector out of two different ones as shown in the piece of code below.
However, I get a list out of range exception on the 5th line the first time the code goes in the for loop.
What am I doing wrong?
def get_two_dimensional_vector(speeds, directions):
vector = []
for i in range(10):
if (i % 2 == 0):
vector[i/2][0] = speeds[i/2]
else :
vector[i/2 - 1/2][1] = directions[i/2 - 1/2]
You can't use a Python list this way. It's not like a C array with a
predefined length. If you want to add a new element, you have to use the
append method or something.
Aside from that, you're also using a second index, implying that the
elements of vector are themselves lists or dicts or something, before
they've even been assigned.
It looks like you want to convert speeds and directions to a
two-dimensional list. So, first, here's how to do that with a loop. Note
that I've removed the fixed-size assumption you were using, though the
code still assumes that speeds and directions are the same size.
def get_two_dimensional_vector(speeds, directions):
vector = []
for i in range(len(speeds)):
vector.append([speeds[i], directions[i]])
return vector
speeds = [1, 2, 3]
directions = [4, 5, 6]
v = get_two_dimensional_vector(speeds, directions)
print(v)
Now, the Pythonic way to do it.
print(zip(speeds, directions))
How to detect repeating digits in an infinite sequence? I tried Floyd & Brent detection algorithm but come to nothing...
I have a generator that yields numbers ranging from 0 to 9 (inclusive) and I have to recognize a period in it.
Example test case:
import itertools
# of course this is a fake one just to offer an example
def source():
return itertools.cycle((1, 0, 1, 4, 8, 2, 1, 3, 3, 1))
>>> gen = source()
>>> period(gen)
(1, 0, 1, 4, 8, 2, 1, 3, 3, 1)
Empirical methods
Here's a fun take on the problem. The more general form of your question is this:
Given a repeating sequence of unknown length, determine the period of
the signal.
The process to determine the repeating frequencies is known as the Fourier Transform. In your example case the signal is clean and discrete, but the following solution will work even with continuous noisy data! The FFT will try to duplicate the frequencies of the input signal by approximating them in the so-called "wave-space" or "Fourier-space". Basically a peak in this space corresponds to a repeating signal. The period of your signal is related to the longest wavelength that is peaked.
import itertools
# of course this is a fake one just to offer an example
def source():
return itertools.cycle((1, 0, 1, 4, 8, 2, 1, 3, 3, 2))
import pylab as plt
import numpy as np
import scipy as sp
# Generate some test data, i.e. our "observations" of the signal
N = 300
vals = source()
X = np.array([vals.next() for _ in xrange(N)])
# Compute the FFT
W = np.fft.fft(X)
freq = np.fft.fftfreq(N,1)
# Look for the longest signal that is "loud"
threshold = 10**2
idx = np.where(abs(W)>threshold)[0][-1]
max_f = abs(freq[idx])
print "Period estimate: ", 1/max_f
This gives the correct answer for this case, 10 though if N didn't divide the cycles cleanly, you would get a close estimate. We can visualize this via:
plt.subplot(211)
plt.scatter([max_f,], [np.abs(W[idx]),], s=100,color='r')
plt.plot(freq[:N/2], abs(W[:N/2]))
plt.xlabel(r"$f$")
plt.subplot(212)
plt.plot(1.0/freq[:N/2], abs(W[:N/2]))
plt.scatter([1/max_f,], [np.abs(W[idx]),], s=100,color='r')
plt.xlabel(r"$1/f$")
plt.xlim(0,20)
plt.show()
Evgeny Kluev's answer provides a way to get an answer that might be right.
Definition
Let's assume you have some sequence D that is a repeating sequence. That is there is some sequence d of length L such that: D_i = d_{i mod L}, where t_i is the ith element of sequence t that is numbered from 0. We will say sequence d generates D.
Theorem
Given a sequence D which you know is generated by some finite sequence t. Given some d it is impossible to decide in finite time whether it generates D.
Proof
Since we are only allowed a finite time we can only access a finite number of elements of D. Let us suppose we access the first F elements of D. We chose the first F because if we are only allowed to access a finite number, the set containing the indices of the elements we've accessed is finite and hence has a maximum. Let that maximum be M. We can then let F = M+1, which is still a finite number.
Let L be the length of d and that D_i = d_{i mod L} for i < F. There are two possibilities for D_F it is either the same as d_{F mod L} or it is not. In the former case d seems to work, but in the latter case it does not. We cannot know which case is true until we access D_F. This will however require accessing F+1 elements, but we are limited to F element accesses.
Hence, for any F we won't have enough information to decide whether d generates D and therefore it is impossible to know in finite time whether d generates D.
Conclusions
It is possible to know in finite time that a sequence d does not generate D, but this doesn't help you. Since you want to find a sequence d that does generate D, but this involves amongst other things being able to prove that some sequence generates D.
Unless you have more information about D this problem is unsolvable. One bit of information that will make this problem decidable is some upper bound on the length of the shortest d that generates D. If you know the function generating D only has a known amount of finite state you can calculate this upper bound.
Hence, my conclusion is that you cannot solve this problem unless you change the specification a bit.
I have no idea about proper algorithms to apply here, but my understanding also is that you can never know for sure that you've detected a period if you have consumed only a finite number of terms. Anyway, here's what I've come up with, this is a very naive implementation, more to educate from the comments than to provide a good solution (I guess).
def guess_period(source, minlen=1, maxlen=100, trials=100):
for n in range(minlen, maxlen+1):
p = [j for i, j in zip(range(n), source)]
if all([j for i, j in zip(range(n), source)] == p
for k in range(trials)):
return tuple(p)
return None
This one, however, "forgets" the initial order and returns a tuple that is a cyclic permutation of the actual period:
In [101]: guess_period(gen)
Out[101]: (0, 1, 4, 8, 2, 1, 3, 3, 1, 1)
To compensate for this, you'll need to keep track of the offset.
Since your sequence is not of the form Xn+1 = f(Xn), Floyd's or Brent's algorithms are not directly applicable to your case. But they may be extended to do the task:
Use Floyd's or Brent's algorithm to find some repeating element of the sequence.
Find next sequence element with the same value. Distance between these elements is a supposed period (k).
Remember next k elements of the sequence
Find the next occurrence of this k-element subsequence.
If distance between subsequences is greater than k, update k and continue with the step 3.
Repeat step 4 several times to verify the result. If maximum length of the repeating sub-sequence is known a-priori, use appropriate number of repetitions. Otherwise use as much repetitions as possible, because each repetition increases the result's correctness.
If the sequence cycling starts from the first element, ignore step 1 and start from step 2 (find next sequence element equal to the first element).
If the sequence cycling does not start from the first element, it is possible that Floyd's or Brent's algorithm finds some repeating element of the sequence that does not belong to the cycle. So it is reasonable to limit the number of iterations in steps 2 and 4, and if this limit is exceeded, continue with the step 1.
I'm trying to write a python script to find patterns in a list.
Eg. Given this list
[1,2,3,4,5,6,4,5,6,4,5,6,4,5,6]
The script would determine that 4,5,6 ocurred 3 times and then print out
3 (4,5,6)
I was hoping if anyone had any insights algorithmically (I can only think of n^2 algorithms where I check for patterns of size 1, then 2, then 3, etc iterating through the string each time) or if there might be any Python built-in libraries which might help do the same things. Thanks!
Here is a function providing a solution to the pattern matching problem:
import itertools
def pattern_match(pattern, sequence):
"""Count the number of times that pattern occurs in the sequence."""
pattern = tuple(pattern)
k = len(pattern)
# create k iterators for the sequence
i = itertools.tee(sequence, k)
# advance the iterators
for j in range(k):
for _ in range(j):
next(i[j])
count = 0
for q in zip(*i):
if pattern == q:
count += 1
return count
To solve the stated problem, call with:
p = [4, 5, 6]
l = [1, 2, 3, 4, 5, 6, 4, 5, 6, 4, 5, 6, 4, 5, 6]
count = pattern_match(p, l)
Here is a Gist with the complete code solving the example problem.
(I believe the correct answer is that the pattern repeats 4 times, not 3 times, as stated in the question.)
I'm not sure if the complexity of this algorithm is actually less than O(n^2).
Off the top of my head, I would do this:
start with two copies of the list A and B.
pop the first value off of B
subtract B from A: C = A-B
search for areas in C that are 0; these indicate repeated strings
add repeated strings into a dict which tracks each string and the number of times it has been seen
repeat steps 2-5 until B is empty.
The algorithm you are looking for is Run Length Encoding. The basic principles of that algorithm will give you how to approach detecting patterns in a sequence and counting them.
Run-length encoding (RLE) is a very simple form of data compression in
which runs of data (that is, sequences in which the same data value
occurs in many consecutive data elements) are stored as a single data
value and count, rather than as the original run.
Here is a relevant article on how to write an RLE program in Python.