I am trying to count the number of islands (a group of connected 1s forms an island) in a 2D binary matrix.
Example:
[
[1, 1, 0, 0, 0],
[0, 1, 0, 0, 1],
[1, 0, 0, 1, 1],
[0, 0, 0, 0, 0],
[1, 0, 1, 0, 1]
]
In the above matrix there are 5 islands, which are:
First: (0,0), (0,1), (1,1), (2,0)
Second: (1,4), (2,3), (2,4)
Third: (4,0)
Fourth: (4,2)
Fifth: (4,4)
To count the number of island in the 2D matrix, I am assuming the matrix as a Graph and then I am using DFS kind of algorithm to count the islands.
I am keeping track for the number of DFS (a recursive function) calls, because that many components would be there in the Graph.
Below is the code I wrote for this purpose:
# count the 1's in the island
def count_houses(mat, visited, i, j):
# base case
if i < 0 or i >= len(mat) or j < 0 or j >= len(mat[0]) or\
visited[i][j] is True or mat[i][j] == 0:
return 0
# marking visited at i, j
visited[i][j] = True
# cnt is initialized to 1 coz 1 is found
cnt = 1
# now go in all possible directions (i.e. form 8 branches)
# starting from the left upper corner of i,j and going down to right bottom
# corner of i,j
for r in xrange(i-1, i+2, 1):
for c in xrange(j-1, j+2, 1):
# print 'r:', r
# print 'c:', c
# don't call for i, j
if r != i and c != j:
cnt += count_houses(mat, visited, r, c)
return cnt
def island_count(mat):
houses = list()
clusters = 0
row = len(mat)
col = len(mat[0])
# initialize the visited matrix
visited = [[False for i in xrange(col)] for j in xrange(row)]
# run over matrix, search for 1 and then do dfs when found 1
for i in xrange(row):
for j in xrange(col):
# see if value at i, j is 1 in mat and val at i, j is False in
# visited
if mat[i][j] == 1 and visited[i][j] is False:
clusters += 1
h = count_houses(mat, visited, i, j)
houses.append(h)
print 'clusters:', clusters
return houses
if __name__ == '__main__':
mat = [
[1, 1, 0, 0, 0],
[0, 1, 0, 0, 1],
[1, 0, 0, 1, 1],
[0, 0, 0, 0, 0],
[1, 0, 1, 0, 1]
]
houses = island_count(mat)
print houses
# print 'maximum houses:', max(houses)
I get a wrong output for the matrix I have passed in argument. I get 7 but there are 5 clusters.
I tried debugging the code for any logical errors. But I couldn't find out where is the problem.
big hammer approach, for reference
had to add structure argument np.ones((3,3)) to add diagonal connectivity
import numpy as np
from scipy import ndimage
ary = np.array([
[1, 1, 0, 0, 0],
[0, 1, 0, 0, 1],
[1, 0, 0, 1, 1],
[0, 0, 0, 0, 0],
[1, 0, 1, 0, 1]
])
labeled_array, num_features = ndimage.label(ary, np.ones((3,3)))
labeled_array, num_features
Out[183]:
(array([[1, 1, 0, 0, 0],
[0, 1, 0, 0, 2],
[1, 0, 0, 2, 2],
[0, 0, 0, 0, 0],
[3, 0, 4, 0, 5]]), 5)
Your algorithm is almost correct except for the line 21:
if r != i and c != j:
cnt += count_houses(mat, visited, r, c)
Instead you want to use or as you want to continue counting provided at least one of the coordinate is not the same as your center.
if r != i or c != j:
cnt += count_houses(mat, visited, r, c)
An alternate and more intuitive way to write this would be the following
if (r, c) != (i, j):
cnt += count_houses(mat, visited, r, c)
Related
A have an algorithm for scan cell moore neighborhood but it does not working can anyone find a bug here, because this problem driving me crazy.
def moore_nh(cells, r, c):
n = 0
dr = [+1, -1, 0, 0, -1, 1, -1, 1]
dc = [0, 0, +1, -1, -1, 1, 1, -1]
for i in range(len(dr)):
if (0 <= r+dr[i] < len(cells)) and (0 <= c+dc[i] < len(cells[r])):
if cells[r+dr[i]][c+dc[i]] == 1:
n+=1
return n
def next_gen(cells):
clone = [[0]*len(cells[0])]*len(cells)
for r in range(len(cells)):
for c in range(len(cells[r])):
n = moore_nh(cells, r, c)
if cells[r][c] == 1:
if n < 2 or n > 3:
clone[r][c] = 0
else:
clone[r][c] = 1
elif ((cells[r][c] == 0) and (n == 3)):
clone[r][c] = 1
return clone
print(next_gen([
[0,1,0],
[0,1,0],
[0,1,0]
]))
I expected that this algorithm has printed
[[0, 0, 0],
[1, 1, 1],
[0, 0, 0]]
but it prints
[[1, 0, 1],
[1, 0, 1],
[1, 0, 1]]
This question already has an answer here:
A star algorithm: Distance heuristics
(1 answer)
Closed 2 years ago.
I have a labyrinth matrix for a maze problem.
Labyrinth =
[[0, 0, 0, 0, 0, 0, 1, 0],
[0, 1, 0, 1, 1, 1, 1, 0],
[0, 1, 1, 1, 0, 1, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 0, 1, 1, 3, 0],
[0, 0, 1, 1, 1, 0, 0, 0],
[0, 1, 2, 0, 1, 1, 1, 0],
[0, 1, 0, 0, 0, 0, 0, 0]]
Here,
0 represents a blocked cell that is a wall
1 represents an empty cell
2 and 3 represents starting and ending points respectively.
I need a function which can return the path from point 2 to 3 after performing an A* Search Algorithm using Manhattan distance as distance estimate and length of the current path as path-cost.
Any Pointers? or tip/clue how I should operate on this one?
Update: I want to return path from begin to end by marking the path with some other character like X. For reference, this:
Labyrinth =
[[0, 0, 0, 0, 0, 0, 1, 0],
[0, 1, 0, 1, 1, 1, 1, 0],
[0, 1, 1, 1, 0, 1, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 0, X, X, 3, 0],
[0, 0, X, X, X, 0, 0, 0],
[0, 1, 2, 0, 1, 1, 1, 0],
[0, 1, 0, 0, 0, 0, 0, 0]]
Classical search algorithm work using a set of states called the fringe and a set of visited states:
the fringe is all the set that are yet to eplore hoping to find the goal state
the visited set is all the states that have already been visited to avoid visiting them again
The idea of A* is to explore the state in the fringe that has a minimal value of cost (defined as the sum of the heuristic cost and the progression cost (computed by all the state you had to pass by before)). You can find generic implementation of this algorithm on the wikipedia page for A* search algorithm. In your case a state may consist in :
the i, j position in the grid
the previous state (assuming None for the first state)
the total cost of this state (heuristic + path cost).
To explore a set you only need to check the direct neighbors of the cell (including only the one where the value is one). It is worth noting that in the visited set you should only include the position (i,j) and the cost (as you may re-enter this state if you found a shorter path, even if it is unlikely in your problem).
Here is an example that works for your case (but may be generalized easily):
def astar(lab):
# first, let's look for the beginning position, there is better but it works
(i_s, j_s) = [[(i, j) for j, cell in enumerate(row) if cell == 2] for i, row in enumerate(lab) if 2 in row][0][0]
# and take the goal position (used in the heuristic)
(i_e, j_e) = [[(i, j) for j, cell in enumerate(row) if cell == 3] for i, row in enumerate(lab) if 3 in row][0][0]
width = len(lab[0])
height = len(lab)
heuristic = lambda i, j: abs(i_e - i) + abs(j_e - j)
comp = lambda state: state[2] + state[3] # get the total cost
# small variation for easier code, state is (coord_tuple, previous, path_cost, heuristic_cost)
fringe = [((i_s, j_s), list(), 0, heuristic(i_s, j_s))]
visited = {} # empty set
# maybe limit to prevent too long search
while True:
# get first state (least cost)
state = fringe.pop(0)
# goal check
(i, j) = state[0]
if lab[i][j] == 3:
path = [state[0]] + state[1]
path.reverse()
return path
# set the cost (path is enough since the heuristic won't change)
visited[(i, j)] = state[2]
# explore neighbor
neighbor = list()
if i > 0 and lab[i-1][j] > 0: #top
neighbor.append((i-1, j))
if i < height and lab[i+1][j] > 0:
neighbor.append((i+1, j))
if j > 0 and lab[i][j-1] > 0:
neighbor.append((i, j-1))
if j < width and lab[i][j+1] > 0:
neighbor.append((i, j+1))
for n in neighbor:
next_cost = state[2] + 1
if n in visited and visited[n] >= next_cost:
continue
fringe.append((n, [state[0]] + state[1], next_cost, heuristic(n[0], n[1])))
# resort the list (SHOULD use a priority queue here to avoid re-sorting all the time)
fringe.sort(key=comp)
I'm designing a maze generator in python and have various functions for different steps of the process. (I know the code can most definitely be improved but I'm just looking for an answer to my problem first before I work on optimizing it)
the first function generates a base maze in the form of a 2D list and works as expected:
def base_maze(dimension):
num_rows = int((2 * dimension[1]) + 1) #number of rows / columns
num_columns = int((2 * dimension[0]) + 1) #from tuple input
zero_row = [] #initialise a row of 0s
for i in range(num_columns):
zero_row.append(0)
norm_row = [] #initialise a row of
for i in range(num_columns // 2): #alternating 0s and 1s
norm_row.extend([0,1])
norm_row.append(0)
maze = [] #initialise maze
#(combination of zero rows
for i in range(num_rows // 2): # and normal rows)
maze.append(zero_row)
maze.append(norm_row)
maze.append(zero_row)
return maze
Another function gets the neighbors of the selected cell, and also works as expected:
def get_neighbours(cell, dimension):
y = cell[0] #set x/y values
max_y = dimension[0] - 1 #for reference
x = cell[1]
max_x = dimension[1] - 1
n = (x, y-1) #calculate adjacent
e = (x+1, y) #coordinates
s = (x, y+1)
w = (x-1, y)
if y > max_y or y < 0 or x > max_x or x < 0: #check if x/y
raise IndexError("Cell is out of maze bounds") #in bounds
neighbours = []
if y > 0: #add cells to list
neighbours.append(n) #if they're valid
if x < max_x: #cells inside maze
neighbours.append(e)
if y < max_y:
neighbours.append(s)
if x > 0:
neighbours.append(w)
return neighbours
the next function removes the wall between two given cells:
def remove_wall(maze, cellA, cellB):
dimension = []
x_dim = int(((len(maze[0]) - 1) / 2)) #calc the dimensions
y_dim = int(((len(maze) - 1) / 2)) #of maze matrix (x,y)
dimension.append(x_dim)
dimension.append(y_dim)
A_loc = maze[2*cellA[1]-1][2*cellA[0]-1]
B_loc = maze[2*cellB[1]-1][2*cellB[0]-1]
if cellB in get_neighbours(cellA, dimension): #if cell B is a neighbour
if cellA[0] == cellB[0] and cellA[1] < cellB[1]: #if the x pos of A is equal
adj_wall = maze[(2*cellA[0]+1)][2*cellA[1]+1+1] = 1 #to x pos of cell B and the y pos
#of A is less than B (A is below B)
elif cellA[0] == cellB[0] and cellA[1] > cellB[1]: #the adjacent wall is set to 1 (removed)
adj_wall = maze[(2*cellA[0]+1)][2*cellA[1]+1-1] = 1
#same is done for all other directions
if cellA[1] == cellB[1] and cellA[0] < cellB[0]:
adj_wall = maze[(2*cellA[0]+1)+1][(2*cellA[1]+1)] = 1
elif cellA[1] == cellB[1] and cellA[0] > cellB[0]:
adj_wall = maze[(2*cellA[0]+1-1)][(2*cellA[1]+1)] = 1
return maze
yet when I try to put these functions together into one final function to build the maze, they do not work as they work on their own, for example:
def test():
maze1 = base_maze([3,3])
maze2 = [[0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 1, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 1, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 1, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0]]
if maze1 == maze2:
print("they are exactly the same")
else:
print("WHY ARE THEY DIFFERENT???")
remove_wall(maze1,(0,0),(0,1))
remove_wall(maze2,(0,0),(0,1))
these will produce different results despite the input being exactly the same?:
test()
they are exactly the same
[[0, 0, 0, 0, 0, 0, 0], [0, 1, 1, 1, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0], [0, 1, 1, 1, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0], [0, 1, 1, 1, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0]]
[[0, 0, 0, 0, 0, 0, 0], [0, 1, 1, 1, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 1, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 1, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0]]
The problem is in your base_maze function, where you first create two types of row:
zero_row = [] #initialise a row of 0s
for i in range(num_columns):
zero_row.append(0)
norm_row = [] #initialise a row of
for i in range(num_columns // 2): #alternating 0s and 1s
norm_row.extend([0,1])
norm_row.append(0)
This is fine so far and works as expected, however when you build the maze from there
for i in range(num_rows // 2): # and normal rows)
maze.append(zero_row)
maze.append(norm_row)
maze.append(zero_row)
You are filling up the maze list with multiple instances of the same list. This means if you modify row 0 of the maze, row 2 & 4 will also be affected. To illustrate:
>>> def print_maze(maze):
... print('\n'.join(' '.join(str(x) for x in row) for row in maze))
...
>>> print_maze(maze)
0 0 0 0 0
0 1 0 1 0
0 0 0 0 0
0 1 0 1 0
0 0 0 0 0
>>> maze[0][0] = 3
>>> print_maze(maze)
3 0 0 0 0
0 1 0 1 0
3 0 0 0 0
0 1 0 1 0
3 0 0 0 0
Note that rows 0, 2, & 4 have all changed. This is because maze[0] is the same zero_row instance as maze[2] and maze[4].
Instead, when you create the maze you want to use a copy of the row lists. This can be done easily in Python using the following slicing notation
for i in range(num_rows // 2):
maze.append(zero_row[:]) # note the [:] syntax for copying a list
maze.append(norm_row[:])
maze.append(zero_row[:])
Given a maze as an array of arrays where 1 is a wall and 0 is a passable area:
Must include start node in distance, if you BFS this it will give you 21.
[0][0] is the start point.
|
[ V
[0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 0],
[0, 0, 0, 0, 0, 0],
[0, 1, 1, 1, 1, 1],
[0, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0]<-- [-1][-1] is the end point.
]
We must find the shortest path possible, we can remove one '1' to help create a shortcut.
The shortcut that creates the shortest path is changing [1][0] to 0, opening a path that makes the distance 11.
[
[0, 0, 0, 0, 0, 0],
-->[0, 1, 1, 1, 1, 0],
[0, 0, 0, 0, 0, 0],
[0, 1, 1, 1, 1, 1],
[0, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0]
]
return 11
My original thought process was run through every element and check if it's == 1, then do a bfs compare the distance with the min.
But ofcourse that was too slow. So I thought running through every element and check if it's 1, then see if it has exactly two neighbors that are passable because that seems to the only possible case where a shortcut is meaningful.
Here is my code:
import copy
def bfs(maze):
visited = set()
queue = []
mazeHeight = len(maze)
mazeWidth = len(maze[0])
queue.append(((0,0),1))
while queue:
yx,distance = queue.pop(0)
y,x = yx
visited.add(yx)
if yx == (mazeHeight-1,mazeWidth-1):
return distance
if y+1 < mazeHeight:
if not maze[y+1][x] and (y+1,x) not in visited:
queue.append(((y+1,x),distance+1))
if y-1 >= 0:
if not maze[y-1][x] and (y-1,x) not in visited:
queue.append(((y-1,x),distance+1))
if x+1 < mazeWidth:
if not maze[y][x+1] and (y,x+1) not in visited:
queue.append(((y,x+1),distance+1))
if x-1 >= 0:
if not maze[y][x-1] and (y,x-1) not in visited:
queue.append(((y,x-1),distance+1))
return False
def answer(maze):
min = bfs(maze)
mazeHeight = len(maze)
mazeWidth = len(maze[0])
for y in range(mazeHeight):
for x in range(mazeWidth):
if maze[y][x]:
oneNeighbors = 0
if y+1 < mazeHeight:
if not maze[y+1][x]:
oneNeighbors += 1
if y-1 >= 0:
if not maze[y-1][x]:
oneNeighbors += 1
if x+1 < mazeWidth:
if not maze[y][x+1]:
oneNeighbors += 1
if x-1 >= 0:
if not maze[y][x-1]:
oneNeighbors += 1
if oneNeighbors == 2:
tmpMaze = copy.deepcopy(maze)
tmpMaze[y][x] = 0
tmpMin = bfs(tmpMaze)
if tmpMin < min:
min = tmpMin
return min
print(answer([[0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 0], [0, 0, 0, 0, 0, 0], [0, 1, 1, 1, 1, 1], [0, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0]]))
Any suggestions to improve the speed?
You seem to be on the right track. The following approach can be considered:
Form a graph of n x m nodes where n and m are the dimensions of the maze matrix.
There is an edge of cost zero between two nodes if they are adjacent 0s. There is an edge of cost one between two nodes if they are both 0s separated by a 1.
(Note that there shall be two costs that you shall need to maintain for each path, one is the above zero-one cost and the other is the number of nodes in the path to keep track of the minimum).
Then perform BFS and consider only paths that have a zero-one cost <= 1.
This shall give you a linear time algorithm (linear in number of nodes).
Following code may contain bugs but it should get you started.
def bfs(maze):
visited = set()
queue = []
mazeHeight = len(maze)
mazeWidth = len(maze[0])
queue.append(((0,0),1,0))
while queue:
yx,distance, cost = queue.pop(0)
y,x = yx
visited.add(yx)
if yx == (mazeHeight-1,mazeWidth-1):
return distance
if y+1 < mazeHeight:
if not maze[y+1][x] and (y+1,x) not in visited:
queue.append(((y+1,x),distance+1, cost))
if y-1 >= 0:
if not maze[y-1][x] and (y-1,x) not in visited:
queue.append(((y-1,x),distance+1, cost))
if x+1 < mazeWidth:
if not maze[y][x+1] and (y,x+1) not in visited:
queue.append(((y,x+1),distance+1, cost))
if x-1 >= 0:
if not maze[y][x-1] and (y,x-1) not in visited:
queue.append(((y,x-1),distance+1, cost))
if cost == 0:
if y+2 < mazeHeight:
if not maze[y+2][x] and (y+2,x) not in visited and maze[y+1][x] == 1:
queue.append(((y+2,x),distance+2, cost+1))
if y-1 >= 0:
if not maze[y-2][x] and (y-2,x) not in visited and maze[y-1][x] == 1:
queue.append(((y-2,x),distance+2, cost+1))
if x+1 < mazeWidth:
if not maze[y][x+2] and (y,x+2) not in visited and maze[y][x+1] == 1:
queue.append(((y,x+2),distance+2, cost+1))
if x-1 >= 0:
if not maze[y][x-2] and (y,x-2) not in visited and maze[y][x-1] == 1:
queue.append(((y,x-2),distance+2, cost+1))
return False
I have a problem where I want to identify and remove columns in a logic matrix that are subsets of other columns. i.e. [1, 0, 1] is a subset of [1, 1, 1]; but neither of [1, 1, 0] and [0, 1, 1] are subsets of each other. I wrote out a quick piece of code that identifies the columns that are subsets, which does (n^2-n)/2 checks using a couple nested for loops.
import numpy as np
A = np.array([[1, 0, 0, 0, 0, 1],
[0, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 1, 1],
[1, 1, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 0, 1, 0, 1, 0]])
rows,cols = A.shape
columns = [True]*cols
for i in range(cols):
for j in range(i+1,cols):
diff = A[:,i]-A[:,j]
if all(diff >= 0):
print "%d is a subset of %d" % (j, i)
columns[j] = False
elif all(diff <= 0):
print "%d is a subset of %d" % (i, j)
columns[i] = False
B = A[:,columns]
The solution should be
>>> print B
[[1 0 0]
[0 1 1]
[1 1 0]
[1 0 1]
[1 0 1]
[1 0 0]
[0 1 1]
[0 1 0]]
For massive matrices though, I'm sure there's a way that I could do this faster. One thought is to eliminate subset columns as I go so I'm not checking columns already known to be a subset. Another thought is to vectorize this so don't have O(n^2) operations. Thank you.
Since the A matrices I'm actually dealing with are 5000x5000 and sparse with about 4% density, I decided to try a sparse matrix approach combined with Python's "set" objects. Overall it's much faster than my original solution, but I feel like my process of going from matrix A to list of sets D is not as fast it could be. Any ideas on how to do this better are appreciated.
Solution
import numpy as np
A = np.array([[1, 0, 0, 0, 0, 1],
[0, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 1, 1],
[1, 1, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 0, 1, 0, 1, 0]])
rows,cols = A.shape
drops = np.zeros(cols).astype(bool)
# sparse nonzero elements
C = np.nonzero(A)
# create a list of sets containing the indices of non-zero elements of each column
D = [set() for j in range(cols)]
for i in range(len(C[0])):
D[C[1][i]].add(C[0][i])
# find subsets, ignoring columns that are known to already be subsets
for i in range(cols):
if drops[i]==True:
continue
col1 = D[i]
for j in range(i+1,cols):
col2 = D[j]
if col2.issubset(col1):
# I tried `if drops[j]==True: continue` here, but that was slower
print "%d is a subset of %d" % (j, i)
drops[j] = True
elif col1.issubset(col2):
print "%d is a subset of %d" % (i, j)
drops[i] = True
break
B = A[:, ~drops]
print B
Here's another approach using NumPy broadcasting -
A[:,~((np.triu(((A[:,:,None] - A[:,None,:])>=0).all(0),1)).any(0))]
A detailed commented explanation is listed below -
# Perform elementwise subtractions keeping the alignment along the columns
sub = A[:,:,None] - A[:,None,:]
# Look for >=0 subtractions as they indicate non-subset criteria
mask3D = sub>=0
# Check if all elements along each column satisfy that criteria giving us a 2D
# mask which represent the relationship between all columns against each other
# for the non subset criteria
mask2D = mask3D.all(0)
# Finally get the valid column mask by checking for all columns in the 2D mas
# that have at least one element in a column san the diagonal elements.
# Index into input array with it for the final output.
colmask = ~(np.triu(mask2D,1).any(0))
out = A[:,colmask]
Define subset as col1.dot(col1) == col1.dot(col2) if and only if col1 is a subset of col2
Define col1 and col2 are the same if and only if col1 is subset of col2 and vice versa.
I split the work into two. First get rid of all but one equivalent columns. Then remove subsets.
Solution
import numpy as np
def drop_duplicates(A):
N = A.T.dot(A)
D = np.diag(N)[:, None]
drops = np.tril((N == D) & (N == D.T), -1).any(axis=1)
return A[:, ~drops], drops
def drop_subsets(A):
N = A.T.dot(A)
drops = ((N == np.diag(N)).sum(axis=0) > 1)
return A[:, ~drops], drops
def drop_strict(A):
A1, d1 = drop_duplicates(A)
A2, d2 = drop_subsets(A1)
d1[~d1] = d2
return A2, d1
A = np.array([[1, 0, 0, 0, 0, 1],
[0, 1, 1, 1, 1, 0],
[1, 0, 1, 0, 1, 1],
[1, 1, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 0, 1, 0, 1, 0]])
B, drops = drop_strict(A)
Demonstration
print B
print
print drops
[[1 0 0]
[0 1 1]
[1 1 0]
[1 0 1]
[1 0 1]
[1 0 0]
[0 1 1]
[0 1 0]]
[False True False False True True]
Explanation
N = A.T.dot(A) is a matrix of every combination of dot product. Per the definition of subset at the top, this will come in handy.
def drop_duplicates(A):
N = A.T.dot(A)
D = np.diag(N)[:, None]
# (N == D)[i, j] being True identifies A[:, i] as a subset
# of A[:, j] if i < j. The relationship is reversed if j < i.
# If A[:, j] is subset of A[:, i] and vice versa, then we have
# equivalent columns. Taking the lower triangle ensures we
# leave one.
drops = np.tril((N == D) & (N == D.T), -1).any(axis=1)
return A[:, ~drops], drops
def drop_subsets(A):
N = A.T.dot(A)
# without concern for removing equivalent columns, this
# removes any column that has an off diagonal equal to the diagonal
drops = ((N == np.diag(N)).sum(axis=0) > 1)
return A[:, ~drops], drops