I have a couple hundred coordinates in a 3d space, I need to merge the points closer than a given radius and replace them with the neighbors average.
It sounds like a pretty standard problem but I haven't been able to find a solution so far. The dataset is small enough to be able to compute pairwise distances for all the points.
Don't know, maybe some kind of graph analysis / connected components labelling on the sparse distance matrix?
I don't really need the averaging part, just the clustering (is clustering the correct term here?)
A toy dataset could be coords = np.random.random(size=(100,2))
Here's what I tried so far using scipy.cluster.hierarchy. It seems to work fine, but I'm open to more suggestions (DBSCAN maybe?)
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import fclusterdata
from scipy.spatial.distance import pdist
np.random.seed(0)
fig = plt.figure(figsize=(10,5))
gs = mpl.gridspec.GridSpec(1,2)
gs.update(wspace=0.01, hspace= 0.05)
coords = np.random.randint(30, size=(200,2))
img = np.zeros((30,30))
img[coords.T.tolist()] = 1
ax = plt.subplot(gs[0])
ax.imshow(img, cmap="nipy_spectral")
clusters = fclusterdata(coords, 2, criterion="distance", metric="euclidean")
print(len(np.unique(clusters)))
img[coords.T.tolist()] = clusters
ax = plt.subplot(gs[1])
ax.imshow(img, cmap="nipy_spectral")
plt.show()
Here is a method that uses KDTree to query neighbors and networkx module to gather connected components.
from scipy import spatial
import networkx as nx
cutoff = 2
components = nx.connected_components(
nx.from_edgelist(
(i, j) for i, js in enumerate(
spatial.KDTree(coords).query_ball_point(coords, cutoff)
)
for j in js
)
)
clusters = {j: i for i, js in enumerate(components) for j in js}
Example output:
Related
I want to perform spectral clustering on the 3 circles dataset that I have generated using make circles as shown in the figure. All the three circles are of different classes.
from sklearn.datasets import make_circles
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.cluster import SpectralClustering
import matplotlib.pyplot as plt
import pylab as pl
import networkx as nx
X_small, y_small = make_circles(n_samples=(100,200), random_state=3,
noise=0.07, factor = 0.7)
X_large, y_large = make_circles(n_samples=(100,200), random_state=3,
noise=0.07, factor = 0.4)
y_large[y_large==1] = 2
df = pd.DataFrame(np.vstack([X_small,X_large]),columns=['x1','x2'])
df['label'] = np.hstack([y_small,y_large])
df.label.value_counts()
sns.scatterplot(data=df,x='x1',y='x2',hue='label',style='label',palette="bright")
Since I can't flag this question as duplicate (the similar question has no accepted answer), here is a working example of Spectral Clustering on 3 circles using your code:
X_small, y_small = make_circles(n_samples=(1000,2000), random_state=3,
noise=0.07, factor = 0.1)
X_large, y_large = make_circles(n_samples=(1000,2000), random_state=3,
noise=0.07, factor = 0.6)
y_large[y_large==1] = 2
df = pd.DataFrame(np.vstack([X_small,X_large]),columns=['x1','x2'])
df['label'] = np.hstack([y_small,y_large])
df.label.value_counts()
sns.scatterplot(data=df,x='x1',y='x2',hue='label',style='label',palette="bright")
Then adapt the slightly modified 3 circles dataset (added samples and spread the circles) to the code of this SO answer:
x1 = np.expand_dims(df['x1'].values,axis=1)
x2 = np.expand_dims(df['x2'].values,axis=1)
X = np.concatenate((x1,x2),axis=1)
y = df['label'].values
from sklearn.cluster import SpectralClustering
clustering = SpectralClustering(n_clusters=3, gamma=1000).fit(X)
colors = ['r','g','b']
colors = np.array([colors[label] for label in clustering.labels_])
plt.scatter(X[y==0, 0], X[y==0, 1], c=colors[y==0], marker='X')
plt.scatter(X[y==1, 0], X[y==1, 1], c=colors[y==1], marker='o')
plt.scatter(X[y==2, 0], X[y==2, 1], c=colors[y==2], marker='*')
plt.show()
The np.expand_dims(...,axis=1) is necessary to create the dimension along which to concatenate features with np.concatenate() (we initially have 1D vectors, and we don't want to concatenate along the existing initial dimension which is the samples index dimension). Each plt.scatter() line plots the points of a single true data class (hence the y==y_true index selection) using the associated marker, the colors indicating the class provided by the clustering.
Resulting dataset:
Resulting clusters:
Edit: to use different markers to identify true classes (colors already indicating the clustering classes), as asked by OP in the comments. We unfortunately cannot use an array for markers (as for colors) to produce the plot in a single line of code, this is because marker does not accept a list as input (discussed here).
Edit2: added motivation for the use of np.expand_dims(...,axis=1) and some explanation for the plt.scatter() lines, as asked by OP in the comments.
I have created a list of values of Shannon entropy for a pair of multiple sequence aligned sequences. While plotting the values I get a simple plot. I want to plot a smooth curve over the lines. Can anyone suggest to me what will be the right way to process it? BAsically I want to plot a smooth curve that touches the tip of every bar and goes to zero where the "y axis value" is zero.
link for image: [1]: https://i.stack.imgur.com/SY3jH.png
#importing the relevant packages
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.interpolate import make_interp_spline
from Bio import AlignIO
import warnings
warnings.filterwarnings("ignore")
#function to calculate the Shannon Entropy of a MSA
# H = -sum[p(x).log2(px)]
def shannon_entropy(list_input):
unique_aa = set(list_input)
M = len(list_input)
entropy_list = []
# Number of residues in column
for aa in unique_aa:
n_i = list_input.count(aa)
P_i = n_i/float(M)
entropy_i = P_i*(math.log(P_i,2))
entropy_list.append(entropy_i)
sh_entropy = -(sum(entropy_list))
#print(sh_entropy)
return sh_entropy
#importing the MSA file
#importing the clustal file
align_clustal1 =AlignIO.read("/home/clustal.aln", "clustal")
def shannon_entropy_list_msa(alignment_file):
shannon_entropy_list = []
for col_no in range(len(list(alignment_file[0]))):
list_input = list(alignment_file[:, col_no])
shannon_entropy_list.append(shannon_entropy(list_input))
return shannon_entropy_list
clustal_omega1 = shannon_entropy_list_msa(align_clustal1)
# Plotting the data
plt.figure(figsize=(18,10))
plt.plot(clustal_omega1, 'r')
plt.xlabel('Residue', fontsize=16)
plt.ylabel("Shannon's entropy", fontsize=16)
plt.show()
Edit 1:
Here is what my graph looks like after implementing the "pchip" method. link for the pchip output: https://i.stack.imgur.com/hA3KW.png
pchip monotonic spline output
One approach would be to use PCHIP interpolation, which will give you the monotonic curve with the required behaviour for zero values on the y-axis.
We can't run your exact code example on our machines because you point to a local Clustal file in your 'home' directory.
Here's a simple working example, with link to output image:
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import pchip
mylist = [10,0,0,0,0,9,9,0,0,0,11,11,11,0,0]
mylist_np = np.array(mylist)
samples = np.array(range(len(mylist)))
xnew = np.linspace(samples.min(), samples.max(), 100)
plt.plot(xnew,pchip(samples, mylist_np )(xnew))
plt.show()
I tried to generate an uniform distribution of random integeres on a given interval (it's unimportant whether it contains its upper limit or not) with python. I used the next snippet of code to do so and plot the result:
import numpy as np
import matplotlib.pyplot as plt
from random import randint
propsedPython = np.random.randint(0,32767,8388602)%2048
propsedPythonNoMod = np.random.randint(0,2048,8388602)
propsedPythonNoModIntegers = np.random.random_integers(0,2048,8388602)
propsedPythonNoModRandInt = np.empty(8388602)
for i in range(8388602):
propsedPythonNoModRandInt[i] = randint(0,2048)
plt.figure(figsize=[16,10])
plt.title(r'distribution $\rho_{prop}$ off all the python simulated proposed indices')
plt.xlabel(r'indices')
plt.ylabel(r'$\rho_{prop}$')
plt.yscale('log')
plt.hist(propsedPython,bins=1000,histtype='step',label=r'np.random.randint(0,32767,8388602)%2048')
plt.hist(propsedPythonNoMod,bins=1000,histtype='step',label=r'np.random.randint(0,2048,8388602')
plt.hist(propsedPythonNoModIntegers,bins=1000,histtype='step',label=r'np.random.random_integers(0,2048,8388602)')
plt.hist(propsedPythonNoModRandInt,bins=1000,histtype='step',label=r'for i in range(8388602):propsedPythonNoModRandInt[i] = randint(0,2048)')
plt.legend(loc=0)
The resulting plot is: Could somebody point me in the right direction why these spikes appear in al the different cases and or gives some advice which routine to use to got uniformly distributed random integers?
Thanks a lot!
Mmm...
I used new NumPy rng facility, and graph looks ok to me.
Code
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng()
N = 1024*500
hist = np.zeros(2048, dtype=np.int32)
q = rng.integers(0, 2048, dtype=np.int32, size=N, endpoint=False)
for k in range(0, N):
hist[q[k]] += 1
x = np.arange(0, 2048, dtype=np.int32)
fig, ax = plt.subplots()
ax.stem(x, hist, markerfmt=' ')
plt.show()
and graph
I have a array with coordinates. I calculates the distances between all points. Now I only want to show the coordinates that have a distance above a certain threshold. How can I do this in python?
import numpy as np
import scipy
import matplotlib.pylab as plt
dx = np.array([b-a for a,b in combinations (x,2)])
dy = np.array([b-a for a,b in combinations (y,2)])
all_distances = scipy.stats.pdist( np.array(list(zip(x,y))) )
all_distances
df3=all_distances[~(all_distances<=35)]
df4=all_distances[~(all_distances<=40)]
df5=all_distances[~(all_distances<=45)]
fig, ax = plt.subplots()
plt.scatter(df3)
plt.ylabel('dy')
plt.xlabel('dx')
plt.show()
Below you see the point with all distances, but now I want a scatterplot with point that are above a threshold of 35
scatterplot
May you are looking for something like this
import numpy as np
from scipy.spatial.distance import pdist
combinations = np.array([(1,2), (3,4), (5,8), (10,12)])
all_distances = pdist( np.array(combinations))
print(all_distances)
print(all_distances[all_distances>3])
You are able to do the same with other arrays to, so probably something like plt.scatter(dx[all_distances>35], dy[all_distances>35]) solves your problem.
I need to cluster customers data that contains categorical and numerical features. numerical features are not on the same ranges (age, income....). I tried Mclust for numerical data after i have scaled it with StandardScale but that gave me intersected groups.
1-Should i normalize if with Standardscale results are not satisfying ?
2-what will be the best way to cluster with K-Prototype?
3-should clustering method should be dependent on the data distribution ?
I use pandas
This is what i have used :
#K-mean Cluster#search K
from scipy.spatial import distance as sci_distance
from sklearn import cluster as sk_cluster
cdata = data
K = range(1, 10)
KM = (sk_cluster.KMeans(n_clusters=k).fit(cdata) for k in K)
centroids = (k.cluster_centers_ for k in KM)
D_k = (sci_distance.cdist(cdata, cent, 'euclidean') for cent in centroids)
dist = (np.min(D, axis=1) for D in D_k)
avgWithinSS = [sum(d) / cdata.shape[0] for d in dist]
plt.plot(K, avgWithinSS, 'b*-')
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Average within-cluster sum of squares')
plt.title('Elbow for KMeans clustering')
plt.show()
#KMean Cluster
from sklearn.cluster import KMeans, AgglomerativeClustering,
AffinityPropagation #For clustering
from sklearn.mixture import GaussianMixture #For GMM clustering
import matplotlib.pyplot as plt # For graphics
import seaborn as sns
#Clustering
def doKmeans(X, nclust=3):
model = KMeans(nclust)
model.fit(X)
clust_labels = model.predict(X)
cent = model.cluster_centers_
return (clust_labels, cent)
clust_labels, cent = doKmeans(data, 3)
kmeans = pd.DataFrame(clust_labels)
data.insert((data.shape[1]),'kmeans',kmeans)
#Plot the clusters obtained using k means
fig = plt.figure()
ax = fig.add_subplot(111)
scatter = ax.scatter(data['var1'],data['var2'],
c=kmeans[0],s=50)
ax.set_title('K-Means Clustering')
ax.set_xlabel('var1')
ax.set_ylabel('var2')
plt.colorbar(scatter)
You are approaching this the very wrong way.
Do not choose an approach just because you manage to get the code to run. This will never give you good results.
First figure out what you need. What is a cluster? What is a clustering (are all points in clusters? probably not. etc.)? What is a good clustering, and how can I measure this? Only then choose algorithms based on how well they match your requirements.
Otherwise, you will be solving the wrong problem.