Assess clusters stability for each cluster - python

I have clustered some data points twice and obtained four clusters (A=1,B=2,C=3,D=4) for both of them. I want to assess the overall stability of the clustering, but also assess each cluster individually (cluster A for the first result(A1) vs cluster A for the second result(A2), B1 vs B2, C1 vs C2, and D1 vs D2).
For the overall stability, I am using the adjusted rand index (ARI) function and have no problem. Nevertheless, when I want to assess ex. A1 vs A2, I don't really know how I should proceed.
The clustering results are the following:
c1 <- c(1, 2, 3, 2, 1, 3, 4, 3, 2, 2, 3, 4, 3, 2, 1, 2, 3, 4, 3, 2, 1, 2, 3, 4, 2, 3, 2, 3, 2, 1, 3, 4, 4, 4, 4, 3, 2, 3, 2, 3, 1, 3, 2, 1, 2, 3, 4, 3, 2, 1, 4, 3, 2, 2, 2, 3, 4, 3, 3, 3, 2, 1, 1, 1, 2)
c2 <- c(1, 2, 4, 4, 1, 3, 4, 2, 2, 2, 3, 4, 1, 2, 1, 2, 3, 4, 3, 2, 1, 2, 2, 4, 2, 3, 2, 3, 2, 1, 3, 3, 4, 3, 4, 3, 2, 3, 2, 3, 1, 1, 1, 1, 2, 3, 4, 3, 2, 1, 4, 3, 2, 2, 2, 3, 4, 3, 3, 3, 2, 1, 1, 1, 2)
Is there any good strategy to look between each type of cluster (ex. A1 vs A2)?
Suggestions that require R or python syntax are accepted.
Thanks in advance!

Related

Randomize list without same entry successively

order_list_raw = []
for i in range(1, 73):
order_list_raw.append(1)
order_list_raw.append(2)
order_list_raw.append(3)
How can I create the same list with a randomized order but without having the same entry successively (e.g. "1, 3, 2" is okay but not "1, 1, 3").
For randomization I would create a new list like this:
order_list = random.sample(order_list_raw, len(order_list_raw))
A solution would be:
result = []
for i in range(72):
options = [1, 2, 3]
try:
last_item = result[-1]
options.remove(last_item)
except IndexError:
pass
result.append(random.choice(options))
print(result)
Output:
[1, 3, 2, 1, 2, 3, 1, 2, 3, 2, 1, 3, 2, 3, 2, 1, 2, 1, 2, 3, 1, 2, 1, 3, 1, 2, 3, 2, 3, 2, 3, 2, 1, 2, 3, 1, 2, 3, 2, 1, 2, 1, 3, 2, 3, 2, 3, 2, 1, 2, 3, 2, 3, 1, 3, 2, 1, 3, 1, 3, 1, 3, 1, 2, 3, 2, 1, 3, 1, 2, 1, 3]
Here we simply take our options, check what the last value in the list is and delete that value from the options. Then we take a random value from the left over options, and append it to the list.
In case if you want to generate the input data randomly then you can use this solution.
import random
b=[]
for i in range(0,73):
x=random.randint(1,10)
if len(b)==0 or b[-1]!=x:
b.append(x)
print(b)
Output :
[6, 2, 3, 5, 6, 5, 3, 8, 1, 5, 4, 9, 4, 9, 8, 6, 9, 2, 1, 5, 8, 6, 1, 9, 6, 9, 3, 6, 5, 7, 9, 1, 9, 5, 9, 3, 4, 3, 7, 8, 3, 4, 5, 9, 1, 4, 9, 2, 1, 5, 7, 1, 10, 2, 4, 2, 1, 7, 1, 5, 4, 1, 2]
But in case if your input data is fixed, then you can try this solution as below.
a=[1,1,4]
b=[]
c=[[b.append(i) for i in a if len(b)==0 or b[-1]!=i]for j in range(0,100)]
print(b)
Output :
[1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4]

How can I vectorize this for loop below, where I need to set values to a range I need to round?

I have a np.array q with some values for example: [1,3,5,7] .
And a np.array z. with some values that I need to round and than they are used as index in the
Third array 'mapping'.
import numpy as np
q = [1,3,5,7]
z = [0,50.3,240.4,252.9,256]
mapping = np.zeros(256)
for i in range(len(q)):
print(i)
start, end = int(round(z[i])), int(round(z[i + 1]))
mapping[start:end] = int(round(q[i]))
print(mapping)
The output here is:
Here's my approach:
repeats = np.diff(list(np.round(z))+ [256]).astype(int)
# repeats = array([ 49, 191, 12, 3])
np.repeat(np.round(q), repeats)
Output:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 7, 7, 7])
Note: this only has 255 elements and it's different from your expected output, because, tbh I don't really understand your logic.

Creating a list from data within another list

I have created a list
a=[1,2,3,4,5]*100
I now need to create another list that will contain the first 8 prime number locations from within a.
I have tried these two lines of code and they didn't work
b=a[2:3:5:7:11:13:17:19]
a[2:3:5:7:11:13:17:19]=b
The output for list A is "[1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5]" so its the locations 2,3,5,7,11,13,17,19 out of that output
a=[1,2,3,4,5]*100
indices = [2,3,5,7,11,13,17,19]
b = []
for i in indices:
b.append(a[i])
print(b)
You have to access each element individually. b=a[2:3:5:7:11:13:17:19] is not valid syntatically in Python. Actually, this is not the way to access elements at particular indices.
Pythonic way to do the same thing (It will reduce code length) using List Comprehension:
indices = [2,3,5,7,11,13,17,19]
b = [a[i] for i in indices]
I would try it like this using list comprehension (beware the test_prime method is not optimized at all):
def test_prime(n):
if (n==1):
return False
elif (n==2):
return True;
else:
for x in range(2,n):
if(n % x==0):
return False
return True
a=[1,2,3,4,5]*100
b = [item for item in range(len(a)) if test_prime(a[item])]
b = b[0:8]
print b
which outputs (note Python counts from 0, so the first element of an array is 0 and not 1):
[1, 2, 4, 6, 7, 9, 11, 12]

how to calculate Numpy.prod() that doesn't fit in 32bits

I need to evaluate the product of a big list of integers
n=[3, 2, 6, 5, 1, 5, 5, 5, 3, 1, 2, 1, 6, 2, 4, 3, 5, 6 ,1 ,6, 1, 1, 6, 2, 1, 4, 6, 2, 1, 4, 2, 2, 4, 2, 5, 1, 2, 5, 4, 3, 6, 3, 1, 4, 1, 2, 5, 6, 3, 6]
np.prod(n)
>>>> -2147483648
However the product result should be:
24073471210291200000000
Could you please suggest a way to get around it and maintain high performance of numpy operations?
I can do the product with a for-loop but I thought it would be a slower operation in comparison to numpy.prod()
Thank you very much

Calculating and plotting count ratios with Pandas

I have multidimensional data in a pandas data frame with one variable indicating class. For example here is my attempt with a poor-maps heatmap scatter plot:
import pandas as pd
import random
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.cm import get_cmap
nrows=1000
df=pd.DataFrame([[random.random(), random.random()]+[random.randint(0, 1)] for _ in range(nrows)],
columns=list("ABC"))
bins=np.linspace(0, 1, 20)
df["Abin"]=[bins[i-1] for i in np.digitize(df.A, bins)]
df["Bbin"]=[bins[i-1] for i in np.digitize(df.B, bins)]
g=df.ix[:,["Abin", "Bbin"]+["C"]].groupby(["Abin", "Bbin"])
data=g.agg(["sum", "count"])
data.reset_index(inplace=True)
data["classratio"]=data[("C", "sum")]/data[("C","count")]
plt.scatter(data.Abin, data.Bbin, c=data.classratio, cmap=get_cmap("RdYlGn_r"), marker="s")
I'd like to plot class densities over binned features. Now I used np.digitize for binning and some complicating Python hand-made density calculation to plot a heatmap.
Surely, this can be done more compactly with Pandas (pivot?)? Do you know a neat way to bin the two features (for example 10 bins on the interval 0...1) and then plot a class density heatmap where color indicates the ratio of 1's to total rows within this 2D-bin?
Yep, it can be done in a very concise way using the build in cut function:
In [65]:
nrows=1000
df=pd.DataFrame([[random.random(), random.random()]+[random.randint(0, 1)] for _ in range(nrows)],
columns=list("ABC"))
In [66]:
#This does the trick.
pd.crosstab(np.array(pd.cut(df.A, 20)), np.array(pd.cut(df.B, 20))).values
Out[66]:
array([[2, 2, 2, 2, 7, 2, 3, 5, 1, 4, 2, 2, 1, 3, 2, 1, 7, 2, 4, 2],
[1, 2, 4, 2, 0, 3, 3, 3, 1, 1, 2, 1, 4, 3, 2, 1, 1, 2, 2, 1],
[0, 4, 1, 3, 1, 3, 2, 5, 2, 3, 1, 1, 1, 4, 2, 3, 6, 5, 2, 2],
[5, 2, 3, 2, 2, 1, 3, 2, 4, 0, 3, 2, 0, 4, 3, 2, 1, 3, 1, 3],
[2, 2, 4, 1, 3, 2, 2, 4, 1, 4, 3, 5, 5, 2, 3, 3, 0, 2, 4, 0],
[2, 3, 3, 5, 2, 0, 5, 3, 2, 3, 1, 2, 5, 4, 4, 3, 4, 3, 6, 4],
[3, 2, 2, 4, 3, 3, 2, 0, 0, 4, 3, 2, 2, 5, 4, 0, 1, 2, 2, 3],
[0, 0, 4, 4, 3, 2, 4, 6, 4, 2, 0, 5, 2, 2, 1, 3, 4, 4, 3, 2],
[3, 2, 2, 3, 4, 2, 1, 3, 1, 3, 4, 2, 4, 3, 2, 3, 2, 3, 4, 4],
[0, 1, 1, 4, 1, 4, 3, 0, 1, 1, 1, 2, 6, 4, 3, 5, 3, 3, 1, 4],
[2, 2, 4, 1, 3, 4, 1, 2, 1, 3, 3, 3, 1, 2, 1, 5, 2, 1, 4, 3],
[0, 0, 0, 4, 2, 0, 2, 3, 2, 2, 2, 4, 4, 2, 3, 2, 1, 2, 1, 0],
[3, 3, 0, 3, 1, 5, 1, 1, 2, 5, 6, 5, 0, 0, 3, 2, 1, 5, 7, 2],
[3, 3, 2, 1, 2, 2, 2, 2, 4, 0, 1, 3, 3, 1, 5, 6, 1, 3, 2, 2],
[3, 0, 3, 4, 3, 2, 1, 4, 2, 3, 4, 0, 5, 3, 2, 2, 4, 3, 0, 2],
[0, 3, 2, 2, 1, 5, 1, 4, 3, 1, 2, 2, 3, 5, 1, 2, 2, 2, 1, 2],
[1, 3, 2, 1, 1, 4, 4, 3, 2, 2, 5, 5, 1, 0, 1, 0, 4, 3, 3, 2],
[2, 2, 2, 1, 1, 3, 1, 6, 5, 2, 5, 2, 3, 4, 2, 2, 1, 1, 4, 0],
[3, 3, 4, 7, 0, 2, 6, 4, 1, 3, 4, 4, 1, 4, 1, 1, 2, 1, 3, 2],
[3, 6, 3, 4, 1, 3, 1, 3, 3, 1, 6, 2, 2, 2, 1, 1, 4, 4, 0, 4]])
In [67]:
abins=np.linspace(df.A.min(), df.A.max(), 21)
bbins=np.linspace(df.B.min(), df.B.max(), 21)
Z=pd.crosstab(np.array(pd.cut(df.ix[df.C==1, 'A'], abins)),
np.array(pd.cut(df.ix[df.C==1, 'B'], bbins)), aggfunc=np.mean).div(
pd.crosstab(np.array(pd.cut(df.A, abins)),
np.array(pd.cut(df.B, bbins)), aggfunc=np.mean)).values
Z = np.ma.masked_where(np.isinf(Z),Z)
x=np.linspace(df.A.min(), df.A.max(), 20)
y=np.linspace(df.B.min(), df.B.max(), 20)
X,Y=np.meshgrid(x, y)
plt.contourf(X, Y, Z, vmin=0, vmax=1)
plt.colorbar()
plt.pcolormesh(X, Y, Z, vmin=0, vmax=1)
plt.colorbar()

Categories

Resources