Scipy hierarchical clustering appropriate linkage method - python

Apologies because I asked a similar question yesterday but I feel my question lacked content, hopefully now it will be easier to understand.
I have a symmetric matrix with pairwise distances between individuals (see below), and I want to cluster groups of individuals in a way that all members of a cluster will have pairwise distances of zero. I have applied scipy.cluster.hierarchy using different linkage methods and clustering criteria for this but I don't get my expected results. In the example below I would argue that ind5 shouldn't be part of the cluster #1 because it's distance to ind9 is 1 and not 0.
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import squareform
import numpy as np
import pandas as pd
df = pd.read_csv(infile1, sep = '\t', index_col = 0)
print(df)
ind1 ind2 ind3 ind4 ind5 ind6 ind7 ind8 ind9
ind1 0 29 27 1 2 1 2 1 1
ind2 29 0 2 30 31 29 31 30 30
ind3 27 2 0 28 29 27 29 28 28
ind4 1 30 28 0 0 0 1 2 0
ind5 2 31 29 0 0 0 2 2 1
ind6 1 29 27 0 0 0 1 2 0
ind7 2 31 29 1 2 1 0 3 1
ind8 1 30 28 2 2 2 3 0 2
ind9 1 30 28 0 1 0 1 2 0
X = squareform(df.to_numpy())
print(X)
[29 27 1 2 1 2 1 1 2 30 31 29 31 30 30 28 29 27 29 28 28 0 0 1
2 0 0 2 2 1 1 2 0 3 1 2]
Z = linkage(X, 'single')
print(Z)
[[ 3. 4. 0. 2.]
[ 5. 9. 0. 3.]
[ 8. 10. 0. 4.]
[ 0. 11. 1. 5.]
[ 6. 12. 1. 6.]
[ 7. 13. 1. 7.]
[ 1. 2. 2. 2.]
[14. 15. 27. 9.]]
max_d = 0
clusters = fcluster(Z, max_d, criterion='distance')
sample_list = df.index.to_list()
clust_name_list = clusters.tolist()
result = pd.DataFrame({'Inds': sample_list, 'Clusters': clust_name_list})
print(result)
Inds Clusters
0 ind1 2
1 ind2 5
2 ind3 6
3 ind4 1
4 ind5 1
5 ind6 1
6 ind7 3
7 ind8 4
8 ind9 1
I was hoping that anybody more familiar with these methods could advice whether there is any linkage method that would exclude from the cluster any element (in this case ind5) with distance > 0 to at least one of the other elements in the cluster.
Thanks for your help!
Gonzalo

You can reinterpret your problem as the problem finding cliques in a graph. The graph is obtained from your distance matrix by interpreting a distance of 0 as creating an edge between two nodes. Once you have the graph, you can use networkx (or some other graph theory library) to find the cliques in the graph. The cliques in the graph will be the sets of nodes in which all the pairwise distances in the clique are 0.
Here is your distance matrix (but note that your distances do not satisfy the triangle inequality):
In [136]: D
Out[136]:
array([[ 0, 29, 27, 1, 2, 1, 2, 1, 1],
[29, 0, 2, 30, 31, 29, 31, 30, 30],
[27, 2, 0, 28, 29, 27, 29, 28, 28],
[ 1, 30, 28, 0, 0, 0, 1, 2, 0],
[ 2, 31, 29, 0, 0, 0, 2, 2, 1],
[ 1, 29, 27, 0, 0, 0, 1, 2, 0],
[ 2, 31, 29, 1, 2, 1, 0, 3, 1],
[ 1, 30, 28, 2, 2, 2, 3, 0, 2],
[ 1, 30, 28, 0, 1, 0, 1, 2, 0]])
Convert the distance matrix to the adjacency matrix A:
In [137]: A = D == 0
In [138]: A.astype(int) # Display as integers for a more compact output.
Out[138]:
array([[1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 1, 0, 0, 1],
[0, 0, 0, 1, 1, 1, 0, 0, 0],
[0, 0, 0, 1, 1, 1, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 1, 0, 1, 0, 0, 1]])
Create a networkx graph G, and find the cliques with nx.find_cliques:
In [139]: import networkx as nx
In [140]: G = nx.Graph(A)
In [141]: cliques = nx.find_cliques(G)
In [142]: list(cliques)
Out[142]: [[0], [1], [2], [3, 5, 8], [3, 5, 4], [6], [7]]
(The values in the lists are the indices; e.g. the clique [2] corresponds to the set of labels ['ind3'].)
Note that there are two nontrivial cliques, [3, 5, 8] and [3, 5, 4], and 3 and 5 occur in both. This is a consequence of your distances having this anomalous data: distance(ind5, ind4) = 0, and distance(ind4, ind9) = 0, but distance(ind5, ind9) = 1 (i.e. the triangle inequality is not satisfied). So, by your definition of a "cluster", there are two possible nontrivial clusters: [ind4, ind5, ind9] or [ind4, ind5, ind6].
Finally, note the warning in the networkx documentation: "Finding the largest clique in a graph is NP-complete problem, so most of these algorithms have an exponential running time". If your distance matrix is large, this calculation could take a very long time!

Your solution is correct!
You are getting the following clusters:
cluster 1 with elements ind4, ind5, ind6 and ind9 (at distance 0 from each other).
cluster 2 with element ind1
cluster 3 with element ind7
cluster 4 with element ind8
cluster 5 with element ind2
cluster 6 with element ind3
Only the elements at distance 0 are clustered together in cluster 1, as you require. Clusters 2 to 6 are degenerate clusters, with a single isolated element.
Let's modify the distances so that more proper clusters are created:
X = np.array([ 0, 27, 1, 2, 1, 2, 1, 1,
2, 30, 31, 29, 31, 30, 30,
28, 29, 27, 29, 28, 28,
0, 0, 1, 2, 0,
0, 2, 2, 1,
1, 2, 0,
0, 1,
2])
Z = linkage(X, 'single')
max_d = 0
clusters = fcluster(Z, max_d, criterion='distance')
print("Clusters:", clusters)
for cluster_id in np.unique(clusters):
members = np.where(clusters == cluster_id)[0]
print(f"Cluster {cluster_id} has members {members}")
Getting:
Clusters: [2 2 4 3 3 3 1 1 3]
Cluster 1 has members [6 7]
Cluster 2 has members [0 1]
Cluster 3 has members [3 4 5 8]
Cluster 4 has members [2]

Related

Is it possible to insert a small matrix in to a big matrix in to a desired location?

I have to insert a small matrix into a big matrix (zeros matrix), I was trying through a loop, but every time I am getting the value error: could not broadcast the input array from the shape (6,6) into shape (4,4)
there are two issues:-
how to insert it into the zeros matrix. (specifying the location into the big zeros matrix).
how to put that matrix, from the 23rd row of the 40*40 zeroes matrix.
import numpy as np
ndofs = 39
k = np.array( [ [ 1, 0, 1, 0, 0, 0 ],
[ 0, 12, 6, 0, -12, 6 ],
[ 0, 6 , 4, 0, -6, 2 ],
[ 1, 0, 0, 1, 0, 0 ],
[ 0, -12, -6, 0, 12, 6 ],
[ 0, 6, 2, 0, -6, 4 ] ] )
K = np.zeros((ndofs+1,ndofs+1))
print(K.shape)
# for each element, changes to global coordinates
for i in range(ndofs):
K_temp = np.zeros((ndofs+1,ndofs+1))
K_temp[3*i:3*i+6, 3*i:3*i+6] = k
K += K_temp
print(K)
you just overwrite the indexes in the bigger array...
a = numpy.zeros((50,50))
b = numpy.ones((10,10))
a[2:12,2:12] = b # insert b at 2,2

Transformation of the 3d numpy array

I have 3d array and I need to set to zero its right part. For each 2d slice (n, :, :) of the array the index of the column should be taken from vector b. This index defines separating point - the left and right parts, as shown in the figure below.
a_before = [[[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]
[13 14 15 16]]
[[17 18 19 20]
[21 22 23 24]
[25 26 27 28]
[29 30 31 32]]
[[33 34 35 36]
[37 38 39 40]
[41 42 43 44]
[45 46 47 48]]]
a_before.shape = (3, 4, 4)
b = (2, 3, 1)
a_after_1 = [[[ 1 2 0 0]
[ 5 6 0 0]
[ 9 10 0 0]
[13 14 0 0]]
[[17 18 19 0]
[21 22 23 0]
[25 26 27 0]
[29 30 31 0]]
[[33 0 0 0]
[37 0 0 0]
[41 0 0 0]
[45 0 0 0]]]
After this, for each 2d slice (n, :, :) I have to take index of the column from c vector and multiply by the corresponding value taken from the vector d.
c = (1, 2, 0)
d = (50, 100, 150)
a_after_2 = [[[ 1 100 0 0]
[ 5 300 0 0]
[ 9 500 0 0]
[13 700 0 0]]
[[17 18 1900 0]
[21 22 2300 0]
[25 26 2700 0]
[29 30 3100 0]]
[[4950 0 0 0]
[5550 0 0 0]
[6150 0 0 0]
[6750 0 0 0]]]
I did it but my version looks ugly. Maybe someone can help me.
P.S. I would like to avoid for loops and use only numpy methods.
Thank You.
Here's a version without loops.
In [232]: A = np.arange(1,49).reshape(3,4,4)
In [233]: b = np.array([2,3,1])
In [234]: d = np.array([50,100,150])
In [235]: I,J = np.nonzero(b[:,None]<=np.arange(4))
In [236]: A[I,:,J]=0
In [237]: A[np.arange(3),:,b-1] *= d[:,None]
In [238]: A
Out[238]:
array([[[ 1, 100, 0, 0],
[ 5, 300, 0, 0],
[ 9, 500, 0, 0],
[ 13, 700, 0, 0]],
[[ 17, 18, 1900, 0],
[ 21, 22, 2300, 0],
[ 25, 26, 2700, 0],
[ 29, 30, 3100, 0]],
[[4950, 0, 0, 0],
[5550, 0, 0, 0],
[6150, 0, 0, 0],
[6750, 0, 0, 0]]])
Before I developed this, I wrote an iterative version. It helped me visualize the problem.
In [240]: Ac = np.arange(1,49).reshape(3,4,4)
In [241]:
In [241]: for i,v in enumerate(b):
...: Ac[i,:,v:]=0
...:
In [242]: for i,(bi,di) in enumerate(zip(b,d)):
...: Ac[i,:,bi-1]*=di
It may be easier to understand, and in that sense, less ugly!
The fact that your A has middle dimension that is "just-going-along" for the ride, complicates "vectorizing" the problem.
With a (3,4) 2d array, the solution is just:
In [251]: Ab = Ac[:,0,:]
In [252]: Ab[b[:,None]<=np.arange(4)]=0
In [253]: Ab[np.arange(3),b-1]*=d
Here it is:
import numpy as np
a = np.arange(1,49).reshape(3,4,4)
b = np.array([2,3,1])
c = np.array([1,2,0])
d = np.array([50,100,150])
for i in range(len(b)):
a[i,:,b[i]:] = 0
for i,j in enumerate(c):
a[i,:,j] = a[i,:,j]* d[i]
print(a)
#
[[[ 1 100 0 0]
[ 5 300 0 0]
[ 9 500 0 0]
[ 13 700 0 0]]
[[ 17 18 1900 0]
[ 21 22 2300 0]
[ 25 26 2700 0]
[ 29 30 3100 0]]
[[4950 0 0 0]
[5550 0 0 0]
[6150 0 0 0]
[6750 0 0 0]]]

How can I measure distance from a local minimum value in a numpy array?

I'm using scikit.morphology to do an erosion on a two-dimensional array. I need to also ascertain the distance of each cell to the minimum value identified in the erosion.
Example:
np.reshape(np.arange(1,126,step=5),[5,5])
array([[ 1, 6, 11, 16, 21],
[ 26, 31, 36, 41, 46],
[ 51, 56, 61, 66, 71],
[ 76, 81, 86, 91, 96],
[101, 106, 111, 116, 121]])
erosion(np.reshape(np.arange(1,126,step=5),[5,5]),selem=disk(3))
array([[ 1, 1, 1, 1, 6],
[ 1, 1, 1, 6, 11],
[ 1, 1, 1, 6, 11],
[ 1, 6, 11, 16, 21],
[26, 31, 36, 41, 46]])
Now what I want to do is also return an array that gives me the distance to the minimum like this:
array([[ 0, 1, 2, 3, 3],
[ 1, 1, 2, 3, 3],
[ 2, 2, 3, 3, 3],
[ 3, 3, 3, 3, 3],
[ 3, 3, 3, 3, 3]])
Is there a scikit tool that can do this? If not, any tips on how to efficiently achieve this result?
You can find the distances from the centre of your footprint using scipy.ndimage.distance_transform_cdt, then use SciPy's ndimage.generic_filter to return those values:
import numpy as np
from skimage.morphology import erosion, disk
from scipy import ndimage as ndi
input_arr = np.reshape(np.arange(1,126,step=5),[5,5])
footprint = disk(3)
def distance_from_min(values, distance_values):
d = np.inf
min_val = np.inf
for i in range(len(values)):
if values[i] <= min_val:
min_val = values[i]
d = distance_values[i]
return d
full_footprint = np.ones_like(footprint, dtype=float)
full_footprint[tuple(i//2 for i in footprint.shape)] = 0
# use `ndi.distance_transform_edt` instead for the euclidean distance
distance_footprint = ndi.distance_transform_cdt(
full_footprint, metric='taxicab'
)
# set values outside footprint to 0 for pretty-printing
distance_footprint[~footprint.astype(bool)] = 0
# then, extract it into values matching the values in generic_filter
distance_values = distance_footprint[footprint.astype(bool)]
output = ndi.generic_filter(
input_arr.astype(float),
distance_from_min,
footprint=footprint,
mode='constant',
cval=np.inf,
extra_arguments=(distance_values,),
)
print('input:\n', input_arr)
print('footprint:\n', footprint)
print('distance_footprint:\n', distance_footprint)
print('output:\n', output)
Which gives:
input:
[[ 1 6 11 16 21]
[ 26 31 36 41 46]
[ 51 56 61 66 71]
[ 76 81 86 91 96]
[101 106 111 116 121]]
footprint:
[[0 0 0 1 0 0 0]
[0 1 1 1 1 1 0]
[0 1 1 1 1 1 0]
[1 1 1 1 1 1 1]
[0 1 1 1 1 1 0]
[0 1 1 1 1 1 0]
[0 0 0 1 0 0 0]]
distance_footprint:
[[0 0 0 3 0 0 0]
[0 4 3 2 3 4 0]
[0 3 2 1 2 3 0]
[3 2 1 0 1 2 3]
[0 3 2 1 2 3 0]
[0 4 3 2 3 4 0]
[0 0 0 3 0 0 0]]
output:
[[0. 1. 2. 3. 3.]
[1. 2. 3. 3. 3.]
[2. 3. 4. 4. 4.]
[3. 3. 3. 3. 3.]
[3. 3. 3. 3. 3.]]
This function will be very slow, however. If you want to make it faster, you will need (a) a solution like Numba or Cython for the filter function, in conjunction with SciPy LowLevelCallables and (b) to hardcode the distance array into the distance function, because for LowLevelCallables it is more difficult to pass in extra arguments. Here is a full example with llc-tools, which you can install with pip install numba llc-tools.
import numpy as np
from scipy import ndimage as ndi
from skimage.morphology import erosion, disk
import llc
def filter_func_from_footprint(footprint):
# first, create a footprint where the values are the distance from the
# center
full_footprint = np.ones_like(footprint, dtype=float)
full_footprint[tuple(i//2 for i in footprint.shape)] = 0
# use `ndi.distance_transform_edt` instead for the euclidean distance
distance_footprint = ndi.distance_transform_cdt(
full_footprint, metric='taxicab'
)
# then, extract it into values matching the values in generic_filter
distance_footprint[~footprint.astype(bool)] = 0
distance_values = distance_footprint[footprint.astype(bool)]
# finally, create a filter function with the values hardcoded
#llc.jit_filter_function
def distance_from_min(values):
d = np.inf
min_val = np.inf
for i in range(len(values)):
if values[i] <= min_val:
min_val = values[i]
d = distance_values[i]
return d
return distance_from_min
if __name__ == '__main__':
input_arr = np.reshape(np.arange(1,126,step=5),[5,5])
footprint = disk(3)
eroded = erosion(input_arr, selem=footprint)
filter_func = filter_func_from_footprint(footprint)
result = ndi.generic_filter(
# use input_arr.astype(float) when using euclidean dist
input_arr,
filter_func,
footprint=disk(3),
mode='constant',
cval=np.inf,
)
print('input:\n', input_arr)
print('output:\n', result)
Which gives:
input:
[[ 1 6 11 16 21]
[ 26 31 36 41 46]
[ 51 56 61 66 71]
[ 76 81 86 91 96]
[101 106 111 116 121]]
output:
[[0 1 2 3 3]
[1 2 3 3 3]
[2 3 4 4 4]
[3 3 3 3 3]
[3 3 3 3 3]]
For more reading on low-level callables and llc-tools, in addition to the LowLevelCallable documentation on the SciPy site (linked above, plus links therein), you can read these two blog posts I wrote a few years ago:
SciPy's new LowLevelCallable is a game-changer
Prettier LowLevelCallables with Numba JIT and decorators

NumPy vs SymPy Row operations different?

I cannot understand for the life of me why a row operation with NumPy just clearly leads to the wrong answer. The correct answer is in the SymPy matrix. Can anyone tell me why NumPy is unable to perform the correct calculation? I'm going crazy. Thank you!
# simplex tableau
import numpy as np
import sympy as sp
#NumPy
simplex = np.array([[2,4,3,1,0,0,0, 400],
[4,1,1,0,1,0,0, 200],
[7,4,4,0,0,1,0, 800],
[-3,-4,-2,0,0,0,1, 0]])
simplex[1,:] = simplex[1,:] - (1/4)*simplex[0,:]
print(simplex)
#SymPy
simplex = sp.Matrix([[2,4,3,1,0,0,0, 400],
[4,1,1,0,1,0,0, 200],
[7,4,4,0,0,1,0, 800],
[-3,-4,-2,0,0,0,1, 0]])
simplex[1,:] = simplex[1,:] - (1/4)*simplex[0,:]
simplex
Numpy:
[[ 2 4 3 1 0 0 0 400]
[ 3 0 0 0 1 0 0 100]
[ 7 4 4 0 0 1 0 800]
[ -3 -4 -2 0 0 0 1 0]]
Sympy:
Matrix([
[ 2, 4, 3, 1, 0, 0, 0, 400],
[3.5, 0, 0.25, -0.25, 1, 0, 0, 100.0],
[ 7, 4, 4, 0, 0, 1, 0, 800],
[ -3, -4, -2, 0, 0, 0, 1, 0]])
Your NumPy array has an integer dtype. It literally can't hold floating-point numbers. Give it a floating-point dtype:
simplex = np.array(..., dtype=float)

numpy/pandas: How to convert a series of strings of zeros and ones into a matrix

I have a data that arrives in this format:
[
(1, "000010101001010101011101010101110101", "aaa", ... ),
(0, "111101010100101010101110101010111010", "bb", ... ),
(0, "100010110100010101001010101011101010", "ccc", ... ),
(1, "000010101001010101011101010101110101", "ddd", ... ),
(1, "110100010101001010101011101010111101", "eeee", ... ),
...
]
In tuple format, it looks like this:
(Y, X, other_info, ... )
At the end of the day, I need to train a classifier (e.g. sklearn.linear_model.logistic.LogisticRegression) using Y and X.
What's the most straightforward way to turn the string of ones and zeros into something like a np.array, so that I can run it through the classifier? Seems like there should be an easy answer here, but I haven't been able to think of/google one.
A few notes:
I'm already using numpy/pandas/sklearn, so anything in those libraries is fair game.
For a lot of what I'm doing, it's convenient to have the other_info columns together in a DataFrame
The strings are is pretty long (~20,000 columns), but the total data frame is not very tall (~500 rows).
Since you asked primarily for a way to convert a string of ones and zeros into a numpy array, I'll offer my solution as follows:
d = '0101010000' * 2000 # create a 20,000 long string of 1s and 0s
d_array = np.fromstring(d, 'int8') - 48 # 48 is ascii 0. ascii 1 is 49
This compares favourable to #DSM's solution in terms of speed:
In [21]: timeit numpy.fromstring(d, dtype='int8') - 48
10000 loops, best of 3: 35.8 us per loop
In [22]: timeit numpy.fromiter(d, dtype='int', count=20000)
100 loops, best of 3: 8.57 ms per loop
How about something like this:
Make the dataframe:
In [82]: v = [
....: (1, "000010101001010101011101010101110101", "aaa"),
....: (0, "111101010100101010101110101010111010", "bb"),
....: (0, "100010110100010101001010101011101010", "ccc"),
....: (1, "000010101001010101011101010101110101", "ddd"),
....: (1, "110100010101001010101011101010111101", "eeee"),
....: ]
In [83]:
In [83]: df = pandas.DataFrame(v)
We can use fromiter or array to get an ndarray:
In [84]: d ="000010101001010101011101010101110101"
In [85]: np.fromiter(d, int) # better: np.fromiter(d, int, count=len(d))
Out[85]:
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0,
1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1])
In [86]: np.array(list(d), int)
Out[86]:
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0,
1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1])
There might be a slick vectorized way to do this, but I'd just apply the obvious per-entry function to the values and get on with my day:
In [87]: df[1]
Out[87]:
0 000010101001010101011101010101110101
1 111101010100101010101110101010111010
2 100010110100010101001010101011101010
3 000010101001010101011101010101110101
4 110100010101001010101011101010111101
Name: 1
In [88]: df[1] = df[1].apply(lambda x: np.fromiter(x, int)) # better with count=len(x)
In [89]: df
Out[89]:
0 1 2
0 1 [0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 1 0 1 aaa
1 0 [1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 1 0 bb
2 0 [1 0 0 0 1 0 1 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 ccc
3 1 [0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 1 0 1 ddd
4 1 [1 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 eeee
In [90]: df[1][0]
Out[90]:
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0,
1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1])

Categories

Resources