How to find connected components in a matrix using Julia - python

Assume I have the following matrix (defined here in Julia language):
mat = [1 1 0 0 0 ; 1 1 0 0 0 ; 0 0 0 0 1 ; 0 0 0 1 1]
Considering as a "component" a group of neighbour elements that have value '1', how to identify that this matrix has 2 components and which vertices compose each one?
For the matrix mat above I would like to find the following result:
Component 1 is composed by the following elements of the matrix (row,column):
(1,1)
(1,2)
(2,1)
(2,2)
Component 2 is composed by the following elements:
(3,5)
(4,4)
(4,5)
I can use Graph algorithms like this to identify components in square matrices. However such algorithms can not be used for non-square matrices like the one I present here.
Any idea will be much appreciated.
I am open if your suggestion involves the use of a Python library + PyCall. Although I would prefer to use a pure Julia solution.
Regards

Using Image.jl's label_components is indeed the easiest way to solve the core problem. However, your loop over 1:maximum(labels) may not be efficient: it's O(N*n), where N is the number of elements in labels and n the maximum, because you visit each element of labels n times.
You'd be much better off just visiting each element of labels just twice: once to determine the maximum, and once to assign each non-zero element to its proper group:
using Images
function collect_groups(labels)
groups = [Int[] for i = 1:maximum(labels)]
for (i,l) in enumerate(labels)
if l != 0
push!(groups[l], i)
end
end
groups
end
mat = [1 1 0 0 0 ; 1 1 0 0 0 ; 0 0 0 0 1 ; 0 0 0 1 1]
labels = label_components(mat)
groups = collect_groups(labels)
Output on your test matrix:
2-element Array{Array{Int64,1},1}:
[1,2,5,6]
[16,19,20]
Calling library functions like find can occasionally be useful, but it's also a habit from slower languages that's worth leaving behind. In julia, you can write your own loops and they will be fast; better yet, often the resulting algorithm is much easier to understand. collect(zip(ind2sub(size(mat),find( x -> x == value, mat))...)) does not exactly roll off the tongue.

The answer is pretty simple (though i can't provide python code):
collect all 1s into a list
select an arbitrary element of the list generated in step1 and use an arbitrary graph-traversal algorithm to traverse all neighbored 1s and remove visited 1s from the list generated in step 1
repeat step2 until the list generated in step 1 is empty
In pseudocode (using BFS):
//generate a list with the position of all 1s in the matrix
list pos
for int x in [0 , matrix_width[
for int y in [0 , matrix_height[
if matrix[x][y] == 1
add(pos , {x , y})
while NOT isempty(pos)
//traverse the graph using BFS
list visited
list next
add(next , remove(pos , 0))
while NOT isempty(next)
pair p = remove(next , 0)
add(visited , p)
remove(pos , p)
//p is part of the specific graph that is processed in this BFS
//each repetition of the outer while-loop process a different graph that is part
//of the matrix
addall(next , distinct(visited , neighbour1s(p)))

Just got an answer from julia-users mailing list that solves this problem using Images.jl, a library to work with images in Julia.
They developed a function called "label_components" to identify connected components in matrices.
Then I use a customized function called "findMat" to get the indices of such matrix of components for each component.
The answer, in Julia language:
using Images
function findMat(mat,value)
return(collect(zip(ind2sub(size(mat),find( x -> x == value, mat))...)));
end
mat = [1 1 0 0 0 ; 1 1 0 0 0 ; 0 0 0 0 1 ; 0 0 0 1 1]
labels = label_components(mat);
for c in 1:maximum(labels)
comp = findMat(labels,c);
println("Component $c is composed by the following elements (row,col)");
println("$comp\n");
end

Related

Longest route in a Matrix with hurdles (0 ,1) in python

The problem is how to find the longest route in a matrix of 0 and 1
We don't have any destination and source , We must find the longest possible route with 1 in matrix
For example in matrix below , the length of our longest way is 8 :
1 0 0 1
1 0 0 1
1 1 1 1
0 1 0 1
Or in this matrix , it's 6 :
0 0 0 1
1 1 0 0
0 1 0 0
1 1 1 1
How can we do that in python?
Now, the fist thing is to define a function which takes as input the matrix you want to "explore" and a starting point:
def take_a_step(matrix, pos=(0,0), available=None):
the third arg should be either a matrix with the available places (the points not touched yet), or None for the first iteration; you could also initialize it to a standard value like a matrix of Trues directly in the arguments, but I'd prefer to do it in the function body after checking for None so that the input matrix can be different sizes without causing unnecessary problems.
So it would come to something like:
if(matrix[pos[0]][pos[1]]==0):
return 0 #out of path
if(available is None):
#list comprehension to make a new with same dim as the input
available=[[True for i in range(len(matrix[0]))] for j in range(len(matrix))]
for i in range(4):
available[pos[0]][pos[1]]=False #remove current position from available ones
newpos=pos #+something in each direction
if(available[newpos[0]][newpos[1]]):
take_a_step(matrix, newpos, available)
#save the results for each route, use max()
return maxres+1
Obviously the code needs checking and testing, and there might be more efficient ways of doing the same thing, but it should be enough to let you start

python_ Simultaneously replace values in an array big or smaller than a threshold without using for-loop

(Problem solved) The answer is as follow :
threshold=5
arr = np.arange(10)
new_array=[(1,num)[num>threshold] for num in arr]
new_array=[(0,num)[num<threshold] for num in new_array]
print(new_array)
I found the bug in my code below. It's obvious that arr turned to [1,2,3,4,1,1,1,1,1] and then turned to [0,0,0,0,0,0,0,0,0,0].
So I want to simultaneously replace values in an array big or smaller than a threshold without using for-loop.
I know I can use a for-loop to fix it. But i want it be concise. I can't find any alternative plan.
Here is the code I said ,although it's not the point.
import math
import numpy as np
def cont2disc(arr,threshold):
total = arr.size
tmp = arr
index = math.floor(threshold*total)
tmp.sort()
boundary = tmp[index]
arr[arr>=boundary] = 1
arr[arr<boundary] = 0
return(arr,index,boundary)
t = 0.5
c = np.arange(10)
print(c)
c,index,boundary = cont2disc(c,t)
print(c)
print(index)
print(boundary)
result:
[0 1 2 3 4 5 6 7 8 9]
[0 0 0 0 0 0 0 0 0 0]
5
5
Although it is still a for-loop based solution, If you want a it to be concise, I can suggest you this one liner:
new_array=[(0,num)[num>threshold] for num in arr]
This one liner will replace values smaller than threshold in your array with zeroes. For the other way around just change num>threshold to num<threshold.

levenshtein matrix cell calculation

I do not understand how the values in the levenshtein matrix is calculated According to this article. I do know how we arrive at the edit distance of 3. Could someone explain in lay man terms how we arrive at each value in each cell?
Hi I just had a look at the link of the Wikipedia article you shared:
The way the matrix is built is described in "Definition".
Now I will just translate that into what it means and what you need to do to built the matrix all by yourself:
Just to be sure that no basic information is missing: i denotes the row number and j denotes the column number.
So lets start with the first definition line of the matrix:
It says that the matrix is max(i, j), if min(i,j) = 0
The condition will be fulfilled only for elements of the 0-th row and the 0-th column. (Then min(0, j) is 0 and min(i, 0) is 0). So for the 0-th row and the 0-th column you enter the value of max(i,j), which corresponds to the row number for the 0-th column and the column number for the 0-th row.
So far so good:
k i t t e n
0 1 2 3 4 5 6
s 1
i 2
t 3
t 4
i 5
n 6
g 7
All the other values are built as the minimum of one of these three values:
lev(i-1, j) + 1
lev(i, j-1) + 1
lev(i-1, j-1) + 1_(a_i != b_i)
Where lev corresponds to the already existing levenshtein matrix elements.
The lev(i, j-1) is simply the matrix component to the left of the one, that we want to determine. lev(i-1, j) is the component above and lev(i-1, j-1) is the element left and above. Here, 1_(a_i != b_i) means, that if the letters on this space do not equal 1 is added, otherwise 0.
If we jump right into the matrix element (1, 1), wich corresponds to letters (s, k): We determine the 3 components:
lev(i-1, j) + 1 = 2 [1 + 1 = 2]
lev(i, j-1) + 1 = 2 [1 + 1 = 2]
lev(i-1, j-1) + 1 = 1 [0 + 1 = 1] + 1 because k is clearly not s
Now, we take the minimum of these three values and we found the next entry of the Levenshtein matrix.
Do this evaluation for each single element row OR columnwise and the result is the full Levenshtein matrix.
Hover your mouse above each value with the dots underneath in that matrix in the wikipedia article and it describes in layman's terms what each value means.
e.g. using (x,y) notation
element (0,0) compares None to None. (0,0) = 0 because they are equal
element (0,1) compares 'k' to None. (0,1) = 1 because:
insert 'k' to transform None to 'k' so +1
element (3,2) compares 'kit' to 'si'. (3,2) = 2 because of ``
None == None so +0 - Lev = 0 see element (0,0)
swap 's','k' so +1 - Lev = 1 see element (1,1)
'i' == 'i' so +0 - Lev = 1 see element (2,2)
insert 't' so +1 - Lev = 2 see element (3,2)

Formulation of a recursive solution (variable for loops)

Please consider the below algorithm:
for(j1 = n upto 0)
for(j2 = n-j1 upto 0)
for(j3 = n-j1-j2 upto 0)
.
.
for (jmax = n -j1 - j2 - j_(max-1))
{
count++;
product.append(j1 * j2 ... jmax); // just an example
}
As you can see, some relevant points about the algo snippet above:
I have listed an algorithm with a variable number of for loops.
The result that i calculate at each innermost loop is appended to a list. This list will grow to dimension of 'count'.
Is this problem a suitable candidate for recursion? If yes, i am really not sure how to break the problem up. I am trying to code this up in python, and i do not expect any code from you guys. Just some pointers or examples in the right direction. Thank you.
Here is an initial try for a sample case http://pastebin.com/PiLNTWED
Your algorithm is finding all the m-tuples (m being the max subscript of j from your pseudocode) of non-negative integers that add up to n or less. In Python, the most natural way of expressing that would be with a recursive generator:
def gen_tuples(m, n):
if m == 0:
yield ()
else:
for x in range(n, -1, -1):
for sub_result in gen_tuples(m-1, n-x):
yield (x,)+sub_result
Example output:
>>> for x, y, z in gen_sums(3, 3):
print(x, y, z)
3 0 0
2 1 0
2 0 1
2 0 0
1 2 0
1 1 1
1 1 0
1 0 2
1 0 1
1 0 0
0 3 0
0 2 1
0 2 0
0 1 2
0 1 1
0 1 0
0 0 3
0 0 2
0 0 1
0 0 0
You could also consider using permutations, combinations or product from the itertools module.
If you want all the possible combinations of i, j, k, ... (i.e. nested for loops)
you can use:
for p in product(range(n), repeat=depth):
j1, j2, j3, ... = p # the same as nested for loops
# do stuff here
But beware, the number of iterations in the loop grows exponentially!
the toy example will translate into a kind of tail recursion so, personally, i wouldn't expect a recursive version to be more insightful for code review and maintenance.
however, to get acquainted to the principle, attempt to factor out the invariant parts / common terms from the individual loop and try to identify a pattern (and best prove it afterwards!). you should be able to fix a signature of the recursive procedure to be written. flesh it out with the parts inherent to the loop body/ies (and don't forget the termination condition).
Typically, if you want to transform for loops into recursive calls, you will need to replace the for statements with if statements. For nested loops, you will transform these into function calls.
For practice, start with a dumb translation of the code that works and then attempt to see where you can optimize later.
To give you an idea to try to apply to your situation, I would translate something like this:
results = []
for i in range(n):
results.append(do_stuff(i, n))
to something like this:
results = []
def loop(n, results, i=0):
if i >= n:
return results
results.append(do_stuff(i, n))
i += 1
loop(n, results, i)
there are different ways to handle returning the results list, but you can adapt to your needs.
-- As a response to the excellent listing by Blckgnht -- Consider here the case of n = 2 and max = 3
def simpletest():
'''
I am going to just test the algo listing with assumption
degree n = 2
max = dim(m_p(n-1)) = 3,
so j1 j2 and upto j3 are required for every entry into m_p(degree2)
Lets just print j1,j2,j3 to verify if the function
works in other general version where the number of for loops is not known
'''
n = 2
count = 0
for j1 in range(n, -1, -1):
for j2 in range(n -j1, -1, -1):
j3 = (n-(j1+j2))
count = count + 1
print 'To calculate m_p(%d)[%d], j1,j2,j3 = ' %(n,count), j1, j2, j3
assert(count==6) # just a checkpoint. See P.169 for a proof
print 'No. of entries =', count
The output of this code (and it is correct).
In [54]: %run _myCode/Python/invariant_hack.py
To calculate m_p(2)[1], j1,j2,j3 = 2 0 0
To calculate m_p(2)[2], j1,j2,j3 = 1 1 0
To calculate m_p(2)[3], j1,j2,j3 = 1 0 1
To calculate m_p(2)[4], j1,j2,j3 = 0 2 0
To calculate m_p(2)[5], j1,j2,j3 = 0 1 1
To calculate m_p(2)[6], j1,j2,j3 = 0 0 2
No. of entries = 6

How to optimize edit distance code?

How to optimize this edit distance code i.e. finding the number of bits changed between 2 values! e.g. word1 = '010000001000011111101000001001000110001'
word2 = '010000001000011111101000001011111111111'
When i tried to run on Hadoop it takes ages to complete?
How to reduce the for loop and comparsions ?
#!/usr/bin/python
import os, re, string, sys
from numpy import zeros
def calculateDistance(word1, word2):
x = zeros( (len(word1)+1, len(word2)+1) )
for i in range(0,len(word1)+1):
x[i,0] = i
for i in range(0,len(word2)+1):
x[0,i] = i
for j in range(1,len(word2)+1):
for i in range(1,len(word1)+1):
if word1[i-1] == word2[j-1]:
x[i,j] = x[i-1,j-1]
else:
minimum = x[i-1, j] + 1
if minimum > x[i, j-1] + 1:
minimum = x[i, j-1] + 1
if minimum > x[i-1, j-1] + 1:
minimum = x[i-1, j-1] + 1
x[i,j] = minimum
return x[len(word1), len(word2)]
I looked for a bit counting algorithm online, and I found this page, which has several good algorithms. My favorite there is a one-line function which claims to work for Python 2.6 / 3.0:
return sum( b == '1' for b in bin(word1 ^ word2)[2:] )
I don't have Python, so I can't test, but if this one doesn't work, try one of the others. The key is to count the number of 1's in the bitwise XOR of your two words, because there will be a 1 for each difference.
You are calculating the Hamming distance, right?
EDIT: I'm trying to understand your algorithm, and the way you're manipulating the inputs, it looks like they are actually arrays, and not just binary numbers. So I would expect that your code should look more like:
return sum( a != b for a, b in zip(word1, word2) )
EDIT2: I've figured out what your code does, and it's not the Hamming distance at all! It's actually the Levenshtein distance, which counts how many additions, deletions, or substitutions are needed to turn one string into another (the Hamming distance only counts substitutions, and so is only suitable for equal length strings of digits). Looking at the Wikipedia page, your algorithm is more or less a straight port of the pseudocode they have there. As they point out, the time and space complexity of a comparison of strings of length m and n is O(mn), which is pretty bad. They have a few suggestions of optimizations depending on your needs, but I don't know what you use this function for, so I can't say what would be best for you. If the Hamming distance is good enough for you, the code above should suffice (time complexity O(n)), but it gives different results on some sets of strings, even if they are of equal length, like '0101010101' and '1010101010', which have Hamming distance 10 (flip all bits) and Levenshtein distance 2 (remove the first 0 and add it at the end)
Since you haven't specified what edit distance you're using yet, I'm gonna go on a limb and assume it's Levenshtein distance. In which case, you can shave off some operations here and there:
def levenshtein(a,b):
"Calculates the Levenshtein distance between a and b."
n, m = len(a), len(b)
if n > m:
# Make sure n <= m, to use O(min(n,m)) space.
# Not really important to the algorithm anyway.
a,b = b,a
n,m = m,n
current = range(n+1)
for i in range(1,m+1):
previous, current = current, [i]+[0]*n
for j in range(1,n+1):
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
if a[j-1] != b[i-1]:
change = change + 1
current[j] = min(add, delete, change)
return current[n]
edit: also, you make no mention of your dataset. According to its characteristics, the implementation might change to benefit from it.
Your algorithm seems to do a lot of work. It compares every bit to all bits in the opposite bit vector, meaning you get an algorithmic complexity of O(m*n). That is unnecessary if you are computing Hamming distance, so I assume you're not.
Your loop builds an x[i,j] matrix looking like this:
0 1 0 0 0 0 0 0 1 0 0 ... (word1)
0 0 1 0 0 0 0 0 0 1
1 1 0 1 1 1 1 1 1 0
0 0 1 0 1 1 1 1 1 1
0 0 1 1 0 1 1 1 1 2
0 0 1 1 1 0 1 1 1 2
0 0 1 1 1 1 0 1 1 2
1
1
...
(example word2)
This may be useful for detecting certain types of edits, but without knowing what edit distance algorithm you are trying to implement, I really can't tell you how to optimize it.

Categories

Resources