Now, I have an image.I want to generate a weighted graph G=(V,E) which V is the vertex set and E is the edge set (each pixel in the image as a node in the graph).
But I don't know how to do it.
Is there anyone who can help me? It is better to python.
Thank you very much.
Problem supplement
I'm sorry that my description of the problem is not clear enough.
My goal is to use the pixels of the image as a network of nodes to establish the network, and then analyze the nature of the network to detect the target(maybe).
But in the first step, I need to establish this network. My question is how to use the pixel of the image(RGB) as the node of the network to establish this network for analyzing the image.
The edges of these nodes may be based on some of their characteristics (location, appearance, etc.)
So, I just want to know how to build this network?
Just some simple examples.Thank you
I was looking for nice vectorised answers too and didn't find any. Finally, I have done this myself. My intention is also to speed up these calculations as fast as possible.
Let's start with this nice 28 x 27 image:
import numpy as np
x, y = np.meshgrid(np.linspace(-np.pi/2, np.pi/2, 30), np.linspace(-np.pi/2, np.pi/2, 30))
image = (np.sin(x**2+y**2)[1:-1,1:-2] > 0.9).astype(int) #boolean image
image
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0]
[0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0]
[0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0]
[0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0]
[0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0]
[0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0]
[0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0]
[0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1]
[0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1]
[0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1]
[0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1]
[0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1]
[0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1]
[0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1]
[0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1]
[0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0]
[0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0]
[0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0]
[0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0]
[0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0]
[0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0]
[0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
Networkx
A rationale of algorithm is to identify coordinates of unit pixels that has companions on the right and below. Nodes of network graph should be any hashable objects so we can use tuples to label them. This is quite easy to implement, although, not efficient because it requires to convert items of np.array into tuples:
#CONSTRUCTION OF HORIZONTAL EDGES
hx, hy = np.where(image[1:] & image[:-1]) #horizontal edge start positions
h_units = np.array([hx, hy]).T
h_starts = [tuple(n) for n in h_units]
h_ends = [tuple(n) for n in h_units + (1, 0)] #end positions = start positions shifted by vector (1,0)
horizontal_edges = zip(h_starts, h_ends)
#CONSTRUCTION OF VERTICAL EDGES
vx, vy = np.where(image[:,1:] & image[:,:-1]) #vertical edge start positions
v_units = np.array([vx, vy]).T
v_starts = [tuple(n) for n in v_units]
v_ends = [tuple(n) for n in v_units + (0, 1)] #end positions = start positions shifted by vector (0,1)
vertical_edges = zip(v_starts, v_ends)
And let's see how it looks:
G = nx.Graph()
G.add_edges_from(horizontal_edges)
G.add_edges_from(vertical_edges)
pos = dict(zip(G.nodes(), G.nodes())) # map node names to coordinates
nx.draw_networkx(G, pos, with_labels=False, node_size=0)
labels={node: f'({node[0]},{node[1]})' for node in G.nodes()}
nx.draw_networkx_labels(G, pos, labels, font_size=6, font_family='serif', font_weight='bold', bbox = dict(fc='lightblue', ec="black", boxstyle="round", lw=1))
plt.show()
igraph
Networkx is built purely in Python and performs slowly with big data (like images with millions of pixels). Igraph, on the other hand is built in C but it's supported less. Documentation is not so detailed and internal visualisation tools are used instead of matplotlib. So basically igraph might be a complicated option but if you do it, that's a gigantic win in performance. There are some must-known facts important before implementation of algorithm:
Indices of nodes should be integers starting from 0. This means that if you handle something else in igraph.add_vertices(), it will be reindexed as 0, 1, 2, ... and all the old names of indices kept in igraph.vs['name']
No edges that contains nonexistent indexes (different than 0,1,2,...) of vertices are allowed in use of igraph.add_edges()
Taking these requiremend into consideration, it's a good option to reduce dimension of image, i.e. rename pixels to integers 0,1,2, ... Now here we go:
def create_from_edges(edgearray):
#This function immitates behaviour nx.add_edges_from for empty graph
g = ig.Graph()
u, inv = np.unique(edgearray, return_inverse=True)
e = inv.reshape(edgearray.shape)
g.add_vertices(u) #add vertices, in any order
g.add_edges(e) #add edges, in reindexed order
return g #old indices are kept in g.vs['name']
#Create array of edges with image pixels enumerated from 1 to N
image_idx = np.arange(image.size).reshape(*image.shape) #pixels of image indexed with numbers 1 to N
X, Y = (units.reshape(image.size) for units in np.indices(image.shape)) #X and Y coordinates of image_idx
idx = np.array([X, Y]).T #layout of nodes
hx, hy = np.where(image[1:] & image[:-1]) #horizontal edges as 2D indices
h_starts_idx = image_idx[hx, hy] #image_idx where horizontal edge starts
h_ends_idx = image_idx[hx+1, hy] #image_idx where horizontal edge ends
vx, vy = np.where(image[:, 1:] & image[:, :-1]) #vertical edges as 2D indices
v_starts_idx = image_idx[vx, vy] #image_idx where verical edge starts
v_ends_idx = image_idx[vx, vy+1] #image_idx where vertical edge ends
edgearray = np.vstack([np.array([h_starts_idx, h_ends_idx]).T,
np.array([v_starts_idx, v_ends_idx]).T])
g = create_from_edges(edgearray)
And there's my sketch that illustrates new order of vertex names:
ig.plot(g, bbox=(450, 450),
layout = ig.Layout(idx[g.vs['name']].tolist()), #only lists can be passed in to layout
vertex_color = 'lightblue', vertex_label = g.vs['name'], vertex_size=14, vertex_label_size=8)
requirements: python-igraph, pycairo (for plotting).
Related
I have a dataframe like this
a b c d e f g h i j k l m
mut1 0 0 0 0 0 1 1 1 1 1 1 1 1
mut2 0 0 0 0 0 1 1 1 1 1 0 0 0
mut3 0 0 0 0 0 1 1 0 0 0 0 0 0
mut4 0 0 0 0 0 1 0 0 0 0 0 0 0
mut5 0 0 0 0 0 0 0 1 1 0 0 0 0
mut6 0 0 0 0 0 0 0 1 0 0 0 0 0
mut7 0 0 0 0 0 0 0 0 0 1 0 0 0
mut8 0 0 0 0 0 0 0 0 0 0 1 1 1
mut9 0 0 0 0 0 0 0 0 0 0 1 1 0
mut10 0 0 0 0 0 0 0 0 0 0 0 0 1
mut11 1 1 1 1 1 0 0 0 0 0 0 0 0
mut12 1 1 1 0 0 0 0 0 0 0 0 0 0
mut13 1 1 0 0 0 0 0 0 0 0 0 0 0
mut14 1 0 0 0 0 0 0 0 0 0 0 0 0
mut15 0 0 0 1 0 0 0 0 0 0 0 0 0
mut16 0 0 0 0 1 0 0 0 0 0 0 0 0
and origianl corresponding string
(a:0,b:0,c:0,d:0,e:0,f:0,g:0,h:0,i:0,j:0,k:0,l:0,m:0):0
The algorithm I thought was like this.
In row mut1, we can see that f,g,h,i,j,k,l,m have the same features.
So the string can be modified into
(a:0,b:0,c:0,d:0,e:0,(f:0,g:0,h:0,i:0,j:0,k:0,l:0,m:0):0):0
In row mut2, we can see that f,g,h,i,j have the same features.
So the string can be modified into
(a:0,b:0,c:0,d:0,e:0,((f:0,g:0,h:0,i:0,j:0):0,k:0,l:0,m:0):0):0
Until mut10, it continues to cluster samples in f,g,h,i,j,k,l,m.
And the output will be
(a:0,b:0,c:0,d:0,e:0,(((f:0,g:0):0,(h:0,i:0):0,j:0):0,((k:0,l:0):0,m:0):0):0):0
(For a row with one "1", just skip the process)
From mut10, it stars to cluster samples a,b,c,d,e
and similarly, the final output will be
(((a:0,b:0):0,c:0):0,d:0,e:0,(((f:0,g:0):0,(h:0,i:0):0,j:0):0,((k:0,l:0):0,m:0):0):0):0
So the algorithm is
Cluster the samples with the same features.
After clustering, add ":0" behind the closing parenthesis.
Any suggestions on this process?
*p.s. I have uploaded similar question
Creating a newick format from dataframe with 0 and 1
but this one is more detailed.
Your question asks for a solution in Python, which I'm not familiar with. Hopefully, the following procedure in R will be helpful as well.
What your question describes is matrix representation of a tree. Such a tree can be retrieved from the matrix with a maximum parsimony method using the phangorn package. To manipulate trees in R, newick format is useful. Newick differs from the tree representation in your question by ending with a semicolon.
First, prepare a starting tree in phylo format.
library(phangorn)
tree0 <- read.tree(text = "(a,b,c,d,e,f,g,h,i,j,k,l,m);")
Second, convert your data.frame to a phyDat object, where the rows represent samples and columns features. The phyDat object also requires what levels are present in the data, which is 0 and 1 in this case. Combining the starting tree with the data, we calculate the maximum parsimony tree.
dat0 = read.table(text = " a b c d e f g h i j k l m
mut1 0 0 0 0 0 1 1 1 1 1 1 1 1
mut2 0 0 0 0 0 1 1 1 1 1 0 0 0
mut3 0 0 0 0 0 1 1 0 0 0 0 0 0
mut4 0 0 0 0 0 1 0 0 0 0 0 0 0
mut5 0 0 0 0 0 0 0 1 1 0 0 0 0
mut6 0 0 0 0 0 0 0 1 0 0 0 0 0
mut7 0 0 0 0 0 0 0 0 0 1 0 0 0
mut8 0 0 0 0 0 0 0 0 0 0 1 1 1
mut9 0 0 0 0 0 0 0 0 0 0 1 1 0
mut10 0 0 0 0 0 0 0 0 0 0 0 0 1
mut11 1 1 1 1 1 0 0 0 0 0 0 0 0
mut12 1 1 1 0 0 0 0 0 0 0 0 0 0
mut13 1 1 0 0 0 0 0 0 0 0 0 0 0
mut14 1 0 0 0 0 0 0 0 0 0 0 0 0
mut15 0 0 0 1 0 0 0 0 0 0 0 0 0
mut16 0 0 0 0 1 0 0 0 0 0 0 0 0")
dat1 <- phyDat(data = t(dat0),
type = "USER",
levels = c(0, 1))
tree1 <- optim.parsimony(tree = tree0, data = dat1)
plot(tree1)
The tree now contains a cladogram with no branch lengths. Class phylo is effectively a list, so the zero branch lengths can be added as an extra element.
tree2 <- tree1
tree2$edge.length <- rep(0, nrow(tree2$edge))
Last, we write the tree into a character vector in newick format and remove the semicolon at the end to match the requirement.
tree3 <- write.tree(tree2)
tree3 <- sub(";", "", tree3)
tree3
# [1] "((e:0,d:0):0,(c:0,(b:0,a:0):0):0,((m:0,(l:0,k:0):0):0,((i:0,h:0):0,j:0,(g:0,f:0):0):0):0)"
The documentation seems to be bare bone and the example given in their standard TF tutorial not highlighting a behavior I see. Lets say you have an imbalanced dataset of 1 and 0 (pos and neg), and you want to sample at weights [0.5, 0.5], such that you see the positives more frequently. You would do this:
pos_ds = tf.data.Dataset.from_tensor_slices(np.ones(shape=(16, 1)))
neg_ds = tf.data.Dataset.from_tensor_slices(np.zeros(shape=(128, 1)))
resampled_ds = tf.data.experimental.sample_from_datasets([pos_ds, neg_ds], weights=[0.5, 0.5])
And if I want to see how the pos and neg are distributed as I go through the dataset:
xs = []
for x in resampled_ds:
xs.append(int(x.numpy()[0]))
xs = np.array(xs)
print(xs)
np.bincount(xs)
I see this:
[1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1
0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
array([128, 16])
There are 128 negatives and 16 positives. If I use this as my train_ds, it will be equivalent to no sampling done, and worse, the negatives are no longer uniformly distributed across the steps / epoch. I am guessing that the 0.5 sampling is happening in the beginning and once it "run out" of 1s, it just started sampling the zeros only. It clearly doesn't do sampling with replacement for the 1s. I think the 1s and 0s will only be 0.5/0.5 if you stop after all the 1s are sampled.
It looks like this is the behavior but it isn't the only sensible one. I want to sample the positives multiple times (i.e. sampling with replacement) in 1 epoch, with approx equal amount of pos and negs, is there any option for this API? Also, I have data augmentation so the positives are actually not the same when trained.
You can do something like this for the replacement issue:
resampled_ds = tf.data.experimental.sample_from_datasets([pos_ds.repeat(128 // 16), neg_ds], weights=[0.5, 0.5])
And the result is:
[1 1 1 0 0 1 1 1 1 1 0 1 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1
1 0 0 1 1 1 1 0 1 1 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
1 0 0 1 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 1 0 0
0 0 0 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1
1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 1 1
1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 0 0 1 1 0
0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
Out[2]: array([128, 128], dtype=int64)
Actually, I also found the solution is right there on that TF tutorial imbalanced_data.ipynb (i totally missed this one in my own notebook).
pos_ds = pos_ds.shuffle(BUFFER_SIZE).repeat()
neg_ds = neg_ds.shuffle(BUFFER_SIZE).repeat()
resampled_ds = tf.data.experimental.sample_from_datasets([pos_ds, neg_ds], weights=[0.5, 0.5])
The tutorial further suggest a heuristic to set the resampled_steps_per_epoch.
However, the shuffle + repeat, is still not equivalent to a true sampling with replacement for the minority class. A repeat() follow by a shuffle() may be do it. I can update this by trying both ways.
I have a data set that contains ten columns and 3000 rows. Each of the column contains a 0 or 1. The ten columns concatenated together represent a label. There are ten labels from 0,1,2,3,4,5,6,7,8,9. The concatenated sequences like "1000000000" represents the label zero and "0100000000" represents label one (the number 1) and "0000000001" represents label nine.
What is the best/efficient code to convert these sequences into labels and add it as the eleventh column to the data set
for loop
lambda function
masking
binary and operation
I am confused and currently I am trying to write a lambda function to do this which is getting me nowhere?
target1 = target.apply(lambda x: [print(x) for j in range(10) for i in x], axis = 1)
I would like to know which method I should use to implement this pattern matching .
Initial Data frame
data = [[1,0,0,0,0,0,0,0,0,0],
[0,1,0,0,0,0,0,0,0,0],
[0,0,1,0,0,0,0,0,0,0],
[0,0,0,1,0,0,0,0,0,0],
[0,0,0,0,1,0,0,0,0,0],
[0,0,0,0,0,1,0,0,0,0],
[0,0,0,0,0,0,1,0,0,0],
[0,0,0,0,0,0,0,1,0,0],
[0,0,0,0,0,0,0,0,1,0],
[0,0,0,0,0,0,0,0,0,1]]
df = pd.DataFrame(data)
Final data with the eleventh column name label
[dataframe][label]
1000000000 0
0100000000 1
0010000000 2
0001000000 3
0000100000 4
0000010000 5
0000001000 6
0000000100 7
0000000010 8
0000000001 9
You are effectively looking for the column index with the maximum value, so you can use Dataframe.idxmax(), with axis=1 to apply to the values across each row:
df['label'] = df.idxmax(axis=1)
Note that if you have additional columns than just the 10 numeric columns, you'd want to first select only the 10 numeric columns; e.g. df.iloc[:, range(10)].idxmax(...).
Demo:
>>> import pandas as pd
>>> data = [[1,0,0,0,0,0,0,0,0,0],
... [0,1,0,0,0,0,0,0,0,0],
... [0,0,1,0,0,0,0,0,0,0],
... [0,0,0,1,0,0,0,0,0,0],
... [0,0,0,0,1,0,0,0,0,0],
... [0,0,0,0,0,1,0,0,0,0],
... [0,0,0,0,0,0,1,0,0,0],
... [0,0,0,0,0,0,0,1,0,0],
... [0,0,0,0,0,0,0,0,1,0],
... [0,0,0,0,0,0,0,0,0,1]]
>>> df = pd.DataFrame(data)
>>> df['label'] = df.idxmax(axis=1)
>>> df
0 1 2 3 4 5 6 7 8 9 label
0 1 0 0 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0 0 1
2 0 0 1 0 0 0 0 0 0 0 2
3 0 0 0 1 0 0 0 0 0 0 3
4 0 0 0 0 1 0 0 0 0 0 4
5 0 0 0 0 0 1 0 0 0 0 5
6 0 0 0 0 0 0 1 0 0 0 6
7 0 0 0 0 0 0 0 1 0 0 7
8 0 0 0 0 0 0 0 0 1 0 8
9 0 0 0 0 0 0 0 0 0 1 9
I had advocated using Series.idxmax() via Dataframe.apply() at first, but in a now-deleted comment Jezrael reminded me that Dataframe.idxmax() also exists and is much more practical here.
1.let's generate a pandas DF
import numpy as np
import pandas as pd
n = 10
#---let's generate a pandas DF
M = np.identity(n,dtype=int); M = np.vstack((M,M))
np.random.shuffle(M)
PD = pd.DataFrame(M)
print(PD)
#--- that's the label vector
vLabel = np.arange(n,dtype=int)
So we get:
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 1 0 0 0
1 0 0 0 0 0 1 0 0 0 0
2 0 0 0 0 0 0 0 0 0 1
3 0 0 1 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0
5 0 0 0 0 0 1 0 0 0 0
6 0 0 0 0 0 0 0 0 0 1
7 0 1 0 0 0 0 0 0 0 0
8 1 0 0 0 0 0 0 0 0 0
9 0 1 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 1 0
11 1 0 0 0 0 0 0 0 0 0
12 0 0 0 1 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0 1 0
14 0 0 1 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 1 0 0
16 0 0 0 0 1 0 0 0 0 0
17 0 0 0 0 1 0 0 0 0 0
18 0 0 0 0 0 0 0 1 0 0
19 0 0 0 0 0 0 1 0 0 0
2. the labeling is a matrix-vector multiplication
#--- the labeling is a matrix-vector multiplication
Label = np.dot(PD,vLabel)
print(Label)
So we get:
[6 5 9 2 3 5 9 1 0 1 8 0 3 8 2 7 4 4 7 6]
3. Each row can be transformed into a string
#---- each row can be transformed into a string
for j in range(2*n):
print(str(PD.values[j,:]))
So we get:
[0 0 0 0 0 0 1 0 0 0]
[0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1]
[0 0 1 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1]
[0 1 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0]
[0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0]
[1 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0]
[0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 1 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 1 0 0 0]
And from here you can continue :-)
Note: point 2. (matrix multiplication) is efficient, point 3. (for loop) is not efficient, so you might improve this step.
Suppose I have this:
import numpy as np
x = np.zeros((10,16), dtype=np.int)
x[6:8,3:11] = 1
x[4:6,5:7] = 1
x[2:4,4:8] = 1
x[4:6,9:11] = 1
x[7,2] = 1
x[6,11] = 1
x[8,3] = 1
print(x)
Output:
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0]
[0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0]
[0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0]
[0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0]
[0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
And I want to filter it so that elements in a 4 neighborhood (so, up, left, right, bottom) that have less than than 2 neighbors are removed. So, I'd end up with (last three positions set as one removed):
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0]
[0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0]
[0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0]
[0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0]
[0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
I tried using scipy.ndimage.morphology.binary_closing, scipy.ndimage.morphology.binary_opening, scipy.ndimage.morphology.binary_dilation and scipy.ndimage.morphology.binary_erosion, but the result isn't what I need. I could make 2 for loops and iterate over each element of the array, checking for the neighbor elements, but I feel like there's a better way to do this. Am I mistaken?
I'm more interested in this specific situation (4 neighborhood, keep 2 neighbors), but is it easy to generalize to another neighborhood or number of neighbors (assuming a binary array)?
I managed to get it done like this:
from scipy.signal import convolve2d
kernel = [[0,1,0],[1,1,1],[0,1,0]]
filtered = convolve2d(x, kernel, mode='same')
x[filtered<=2] = 0
Filtered:
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0]
[0 0 0 1 3 4 4 3 1 0 0 0 0 0 0 0]
[0 0 0 1 3 5 5 3 1 1 1 0 0 0 0 0]
[0 0 0 0 2 4 4 2 1 3 3 1 0 0 0 0]
[0 0 0 1 2 4 4 2 2 4 4 2 0 0 0 0]
[0 0 2 3 4 5 5 4 4 5 5 2 1 0 0 0]
[0 1 2 5 4 4 4 4 4 4 3 2 0 0 0 0]
[0 0 2 2 2 1 1 1 1 1 1 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]]
And I got the output I wanted. Thank you #user3080953
I have used joblib.dump to store a machine learning model (21 classes).
When I call the model and test it with a hold-out set I get a value which I do not know what metric it is (accuracy, precision, recall, etc)?!!
0.952380952381
So I computed the confusion matrix and the FP, FN, TN, TP.
I used the information from this Link
I also found some code from a Github.
I compared both results (1 and 2). Both give the same value for Accuracy=0.995464852608. But this result is different from the above one!!!
Any ideas? Did I computed correctly TP, FP, TN, FN?
MY CONFUSION MATRIX
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]]
MY CODE
#Testing with the holdout set
print(loaded_model.score(x_oos, y_oos))
0.952380952381 <------IS IT ACCURACY?
#Calculating the Confusion matrix
cm = confusion_matrix(y_oos, y_oos_pred)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
#Calculating values according to link 2.
FP = cm.sum(axis=0) - np.diag(cm)
FN = cm.sum(axis=1) - np.diag(cm)
TP = np.diag(cm)
TN = (21 - (FP + FN + TP)) #I put 21 because I have 21 classes
# Overall accuracy
ACC = np.mean((TP+TN)/(TP+FP+FN+TN))
print(ACC)
0.995464852608 <----IT IS DIFFERENT FROM THE ABOVE ONE.
Your example is a little bit confusing. If you provide some numbers it would be easier to understand and answer. For example just printing cm would be very helpful.
That being said. The way to deconstruct a sklearn.metrics.confusion_matris is as follows (for a binary classification):
true_neg, false_pos, false_neg, false_pos = confusion_matrix(y_oos, y_oos_pred).ravel()
For multiple classes I think the result is closer to what you have, but with the values summed. Like so:
trues = np.diag(cm).sum()
falses = (cm.sum(0) - np.diag(cm)).sum()
Then you can just compute the accuracy with:
ACC = trues / (trues + falses)
** Update**
From your edited question I can now see that in your confusion matrix you have 21 total samples of which 20 where correctly classified. In that case your accuracy is:
$\frac{20}{21} = 0.95238$
This is the value printed by the model_score method. So you are measuring accuracy. You just aren't reproducing it correctly.
n.b sorry for the latex, but hopefully one day StackOverflow will implement it.
Both are Accuracy.
The first one is the overall accuracy: All_True_Positives/All_classes (20/21).
The second one is the average of accuracies from each class. So we add all these values and divide by 21.
[0.9524 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.9524 1 1 1]