Numpy: value substitution according to neighbours - python

I need to change value for items in a numpy array on the basis of their neighbours values.
More specifically, let's suppose that I have just 3 possible values for each item in a numpy representing an image. Let's suppose my numpy is the following one:
[[1,1,1,1,1,1,1],
[1,1,1,1,1,1,1],
[1,1,1,2,1,1,1],
[1,1,1,2,1,1,1],
[1,1,1,1,1,1,1],
[3,3,3,3,3,3,3],
[3,3,3,3,3,3,3],
[3,3,3,3,3,3,3]
]
What I want is:
Since the size of group of contiguous items containing the value 2 in such example is less than (3 x 3) matrix, I need to assign them the value of neighbour items: in such case 1!
Resulting numpy has to be
[[1,1,1,1,1,1,1],
[1,1,1,1,1,1,1],
[1,1,1,**1**,1,1,1],
[1,1,1,**1**,1,1,1],
[1,1,1,1,1,1,1],
[3,3,3,3,3,3,3],
[3,3,3,3,3,3,3],
[3,3,3,3,3,3,3]
]
What I would like to have is that the 'spurious' elements (only two cells containing the value 2 in an area with a predominance of 1 values) are eliminated and uniformed to the area in which they appear. I hope I have explained. Thanks for any information you can give me. Thanks a lot.

In image processing this operations are called morphological filtering. In your case you can use an opening.
import numpy as np
from skimage.morphology.grey import opening
from skimage.morphology import square
a = np.array(
[[1,1,1,1,1,1,1],
[1,1,1,1,1,1,1],
[1,1,1,2,1,1,1],
[1,1,1,2,1,1,1],
[1,1,1,1,1,1,1],
[3,3,3,3,3,3,3],
[3,3,3,3,3,3,3],
[3,3,3,3,3,3,3]
])
opening(a, square(3))
Out:
[[1 1 1 1 1 1 1]
[1 1 1 1 1 1 1]
[1 1 1 1 1 1 1]
[1 1 1 1 1 1 1]
[1 1 1 1 1 1 1]
[3 3 3 3 3 3 3]
[3 3 3 3 3 3 3]
[3 3 3 3 3 3 3]]

Related

Odd behavior of np.argsort with Pandas

Here is np.argsort applied in four different ways.
print(np.argsort([1,np.nan,3,np.nan, 4]))
print(np.argsort(pd.DataFrame([[1,np.nan,3,np.nan, 4]])).values)
print(np.argsort(pd.Series([1,np.nan,3,np.nan, 4]).values)) # same as first
print(np.argsort(pd.Series([1,np.nan,3,np.nan, 4])).values)
Output:
[0 2 4 1 3]
[[0 2 4 1 3]]
[0 2 4 1 3]
[ 0 -1 1 -1 2]
This is very unexpected behavior. No mention of it in numpy (obviously it will not mention Pandas).
In the Pandas documentation you can find
Returns: Series[np.intp]
Positions of values within the sort order with -1 indicating nan values.
Why? What would be a place where we would want this kind of behavior?

How to overwrite 2-D numpy multi times symmetrically with given index?

I'm trying to change values in matrix a with given index matrix d and matrix e.
And the matrix should always be symmetrical.
What I come up with is to overwrite the primal matrix with given index, and try to make it symmetrical, then go for another overwrite, until all the given index matrix have been gone through. It's not efficient.
But I'm stuck with how make it symmetrical.
For example:
a = np.ones([4,4],dtype=np.object) #the primal matrix
d = np.array([[1],
[2],
[0],
[0]]) #the first index matrix
a[np.arange(a.shape[0])[:,None],d] =2 #the element change to 2 with the indexes shown in d matrix
Now the result is:
a = np.array([[1 2 1 1]
[1 1 2 1]
[2 1 1 1]
[2 1 1 1]])
After making it symmetrical (if a[ i ][ j ] was selected in d matrix, a[ j ][ i ] should also be changed to 2, how to do this part).
The expected output should be :
a = np.array([[1 2 2 2]
[2 1 2 1]
[2 2 1 1]
[2 1 1 1]])
Then, for another overwrite again:
e = np.array([[0],[2],[1],[1]])
a[np.arange(a.shape[0])[:,None],e] =3
Now the result is:
a = np.array([[3 2 2 2]
[2 1 3 1]
[2 3 1 1]
[2 3 1 1]])
Make it symmetrical, (I don't know how to do this part) the final output should be : (overwrite the values if they were given 2 or 1 before)
a = np.array([[3 2 2 2]
[2 1 3 3]
[2 3 1 1]
[2 3 1 1]])
What should I do to get symmetrical matrix?
And, is there anyway to change the primal matrix a directly to get the final result? In a more efficient way?
Thanks in advance !!
You can simply switch the first and second indices and apply the change, the result would be symmetrical:
a[np.arange(a.shape[0])[:,None], d] = 2
a[d, np.arange(a.shape[0])[:,None]] = 2
output:
[[1 2 2 2]
[2 1 2 1]
[2 2 1 1]
[2 1 1 1]]
Same with any number of other changes:
a[np.arange(a.shape[0])[:,None], e] = 3
a[e, np.arange(a.shape[0])[:,None]] = 3
output:
[[3 2 2 2]
[2 1 3 3]
[2 3 1 1]
[2 3 1 1]]

finding where 2d list overlaps by value

One numpy 2d-array looks like this:
[[0 1 2]
[1 5 0]]
Another numpy 2d array which looks like this:
[[0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2]
[0 1 3 4 8 0 1 3 6 7 8 0 1 2 3 6 8]]
I want to get just the places where they "overlap":
[[0 2]
[1 0]]
without using a for loop
You can use intersect1d.
I called n1 the first array and n2 the second one.
The result is not exactly what you expected, but I believe it's correct.
intersection = np.intersect1d(n1, n2)
print(intersection)
[0 1 2]

How to make fcluster to return the same output as cut_tree?

There are a couple of related questions to this, the most relevant is this question I think.
Let's say I have a dataset like this (highly simplified for demonstration purposes):
import numpy as np
import pandas as pd
from scipy.spatial import distance
from scipy.cluster import hierarchy
val = np.array([[0.20288834, 0.80406494, 4.59921579, 14.28184739],
[0.22477082, 1.43444223, 6.87992605, 12.90299896],
[0.22811485, 0.74509454, 3.85198421, 19.22564266],
[0.20374529, 0.73680174, 3.63178517, 17.82544951],
[0.22722696, 0.86113728, 3.00832186, 16.62306058],
[0.25577882, 0.85671779, 3.70655719, 17.49690061],
[0.23018219, 0.68039151, 2.50815837, 15.09039053],
[0.21638751, 1.12455083, 3.56246872, 18.82866991],
[0.26600895, 1.09415595, 2.85300018, 17.93139433],
[0.22369445, 0.73689845, 3.24919113, 18.60914745]])
df = pd.DataFrame(val, columns=["C{}".format(i) for i in range(val.shape[1])])
C0 C1 C2 C3
0 0.202888 0.804065 4.599216 14.281847
1 0.224771 1.434442 6.879926 12.902999
2 0.228115 0.745095 3.851984 19.225643
3 0.203745 0.736802 3.631785 17.825450
4 0.227227 0.861137 3.008322 16.623061
5 0.255779 0.856718 3.706557 17.496901
6 0.230182 0.680392 2.508158 15.090391
7 0.216388 1.124551 3.562469 18.828670
8 0.266009 1.094156 2.853000 17.931394
9 0.223694 0.736898 3.249191 18.609147
I want to cluster the columns of this dataframe and thereby also specify the number of clusters I obtain. Typically, this can be achieved by using the cut_tree function.
However, currently, cut_tree is broken and therefore I looked for alternatives which led me to the link at the beginning of this post where it is suggested to use fcluster as alternative.
Problem is that I don't see how to specify an exact number of clusters but only a maximum number using the maxclust argument.
So for my simple example from above I can do:
# number of target cluster
n_clusters = range(1, 5)
for n_clust in n_clusters:
Z = hierarchy.linkage(distance.pdist(df.T.values), method='average', metric='euclidean')
print("--------\nValues from flcuster:\n{}".format(hierarchy.fcluster(Z, n_clust, criterion='maxclust')))
print("\nValues from cut_tree:\n{}".format(hierarchy.cut_tree(Z, n_clust).T))
which prints
Values from flcuster:
[1 1 1 1]
Values from cut_tree:
[[0 0 0 0]]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 0 0 1]]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 0 1 2]]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 1 2 3]]
As one can see, fcluster returns 2 distinct clusters at maximum while cut_tree returns the desired number.
Is there a way to get the same output for fcluster for the time until the bug in cut_tree is fixed? If not, is there any other good alternative for this in another package?
Not sure how to get the right number of clusters out of fcluster here.
As an alternative, scikit-learn has AgglomerativeClustering:
from sklearn.cluster import AgglomerativeClustering
# number of target cluster
n_clusters = range(1, 5)
for n_clust in n_clusters:
Z = hierarchy.linkage(distance.pdist(df.T.values), method='average', metric='euclidean')
print("--------\nValues from flcuster:\n{}".format(hierarchy.fcluster(Z, n_clust, criterion='maxclust')))
print("\nValues from cut_tree:\n{}".format(hierarchy.cut_tree(Z, n_clust).T))
print("\nValues from AgglomerativeClustering:\n{}".format(AgglomerativeClustering(n_clusters=n_clust, affinity='euclidean', linkage='average').fit(df.T.values).labels_))
which returns the right number of clusters for the provided dataset (although in different order):
Values from flcuster:
[1 1 1 1]
Values from cut_tree:
[[0 0 0 0]]
Values from AgglomerativeClustering:
[0 0 0 0]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 0 0 1]]
Values from AgglomerativeClustering:
[0 0 0 1]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 0 1 2]]
Values from AgglomerativeClustering:
[0 0 2 1]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 1 2 3]]
Values from AgglomerativeClustering:
[3 1 2 0]

Numpy extract values on the diagonal from a matrix

My question is similar(the expanded version) to this post:Numpy extract row, column and value from a matrix. In that post, I extract elements which are bigger than zero from the input matrix, now I want to extract elements on the diagonal, too. So in this case,
from numpy import *
import numpy as np
m=np.array([[0,2,4],[4,0,0],[5,4,0]])
dist=[]
index_row=[]
index_col=[]
indices=np.where(matrix>0)
index_col, index_row = indices
dist=matrix[indices]
return index_row, index_col, dist
we could get,
index_row = [1 2 0 0 1]
index_col = [0 0 1 2 2]
dist = [2 4 4 5 4]
and now this is what I want,
index_row = [0 1 2 0 1 0 1 2]
index_col = [0 0 0 1 1 2 2 2]
dist = [0 2 4 4 0 5 4 0]
I tried to edit line 8 in the original code to this,
indices=np.where(matrix>0 & matrix.diagonal)
but got this error,
How to get the result I want? Please give me some suggestions, thanks!
You can use following method:
get the mask array
fill diagonal of the mask to True
select elements where elements in mask is True
Here is the code:
m=np.array([[0,2,4],[4,0,0],[5,4,0]])
mask = m > 0
np.fill_diagonal(mask, True)
m[mask]

Categories

Resources